9. API Reference

VectorETL provides a set of core classes and functions that you can use to build and customize your ETL pipelines. This section provides an overview of the main components of the API.

Source Classes

All source classes inherit from the BaseSource class:

class BaseSource(ABC):
    @abstractmethod
    def connect(self):
        pass

    @abstractmethod
    def fetch_data(self):
        pass

Key source classes include:

  • S3Source: For Amazon S3 buckets

  • DatabaseSource: For SQL databases

  • LocalFileSource: For local file systems

  • DropboxSource: For Dropbox files

  • GoogleDriveSource: For Google Drive files

Embedding Classes

Embedding classes inherit from the BaseEmbedding class:

class BaseEmbedding(ABC):
    @abstractmethod
    def embed(self, df, embed_column='__concat_final'):
        pass

Key embedding classes include:

  • OpenAIEmbedding: For OpenAI’s embedding models

  • CohereEmbedding: For Cohere’s embedding models

  • GoogleGeminiEmbedding: For Google’s Gemini models

  • AzureOpenAIEmbedding: For Azure OpenAI Service

  • HuggingFaceEmbedding: For Hugging Face models

Target Classes

Target classes inherit from the BaseTarget class:

class BaseTarget(ABC):
    @abstractmethod
    def connect(self):
        pass

    @abstractmethod
    def create_index_if_not_exists(self, dimension):
        pass

    @abstractmethod
    def write_data(self, df, columns, domain=None):
        pass

Key target classes include:

  • PineconeTarget: For Pinecone vector database

  • QdrantTarget: For Qdrant vector database

  • WeaviateTarget: For Weaviate vector database

  • SingleStoreTarget: For SingleStore database

  • SupabaseTarget: For Supabase vector storage

Utility Functions

VectorETL includes several utility functions to help with common tasks:

  • get_source_class(config): Returns the appropriate source class based on configuration

  • get_embedding_model(config): Returns the appropriate embedding class based on configuration

  • get_target_database(config): Returns the appropriate target class based on configuration

Orchestrator

The ETLOrchestrator class coordinates the entire ETL process:

class ETLOrchestrator:
    def __init__(self, source_config, embedding_config, target_config, embed_columns):
        # Initialize components

    def run(self):
        # Run the ETL process

    def fetch_data(self):
        # Fetch data from source

    def process_and_embed_data(self, df):
        # Process and embed data

    def write_to_target(self, df):
        # Write data to target database