9. API Reference
VectorETL provides a set of core classes and functions that you can use to build and customize your ETL pipelines. This section provides an overview of the main components of the API.
Source Classes
All source classes inherit from the BaseSource
class:
class BaseSource(ABC):
@abstractmethod
def connect(self):
pass
@abstractmethod
def fetch_data(self):
pass
Key source classes include:
S3Source
: For Amazon S3 bucketsDatabaseSource
: For SQL databasesLocalFileSource
: For local file systemsDropboxSource
: For Dropbox filesGoogleDriveSource
: For Google Drive files
Embedding Classes
Embedding classes inherit from the BaseEmbedding
class:
class BaseEmbedding(ABC):
@abstractmethod
def embed(self, df, embed_column='__concat_final'):
pass
Key embedding classes include:
OpenAIEmbedding
: For OpenAI’s embedding modelsCohereEmbedding
: For Cohere’s embedding modelsGoogleGeminiEmbedding
: For Google’s Gemini modelsAzureOpenAIEmbedding
: For Azure OpenAI ServiceHuggingFaceEmbedding
: For Hugging Face models
Target Classes
Target classes inherit from the BaseTarget
class:
class BaseTarget(ABC):
@abstractmethod
def connect(self):
pass
@abstractmethod
def create_index_if_not_exists(self, dimension):
pass
@abstractmethod
def write_data(self, df, columns, domain=None):
pass
Key target classes include:
PineconeTarget
: For Pinecone vector databaseQdrantTarget
: For Qdrant vector databaseWeaviateTarget
: For Weaviate vector databaseSingleStoreTarget
: For SingleStore databaseSupabaseTarget
: For Supabase vector storage
Utility Functions
VectorETL includes several utility functions to help with common tasks:
get_source_class(config)
: Returns the appropriate source class based on configurationget_embedding_model(config)
: Returns the appropriate embedding class based on configurationget_target_database(config)
: Returns the appropriate target class based on configuration
Orchestrator
The ETLOrchestrator
class coordinates the entire ETL process:
class ETLOrchestrator:
def __init__(self, source_config, embedding_config, target_config, embed_columns):
# Initialize components
def run(self):
# Run the ETL process
def fetch_data(self):
# Fetch data from source
def process_and_embed_data(self, df):
# Process and embed data
def write_to_target(self, df):
# Write data to target database