6. Embedding Models

Embedding models are crucial in transforming your raw data into vector representations. VectorETL supports several popular embedding models, allowing you to choose the one that best fits your needs.

OpenAI

Python variable (as JSON)

{
    "embedding_model": "OpenAI",
    "api_key": "your-openai-api-key",
    "model_name": "text-embedding-ada-002"
}

YAML

embedding:
  embedding_model: "OpenAI"
  api_key: "your-openai-api-key"
  model_name: "text-embedding-ada-002"

Cohere

Python variable (as JSON)

{
    "embedding_model": "Cohere",
    "api_key": "your-cohere-api-key",
    "model_name": "embed-english-v2.0"
}

YAML

embedding:
  embedding_model: "Cohere"
  api_key: "your-cohere-api-key"
  model_name: "embed-english-v2.0"

Google Gemini

Python variable (as JSON)

{
    "embedding_model": "Google Gemini",
    "api_key": "your-gemini-api-key",
    "model_name": "embedding-001"
}

YAML

embedding:
  embedding_model: "Google Gemini"
  api_key: "your-gemini-api-key"
  model_name: "embedding-001"

Azure OpenAI

Python variable (as JSON)

{
    "embedding_model": "Azure OpenAI",
    "api_key": "your-azure-openai-api-key",
    "endpoint": "your-azure-openai-endpoint",
    "version": "2022-12-01",
    "model_name": "text-embedding-ada-002",
    "private_deployment": "Yes",
    "deployment_name": "your-deployment-name"
}

YAML

embedding:
  embedding_model: "Azure OpenAI"
  api_key: "your-azure-openai-api-key"
  endpoint: "your-azure-openai-endpoint"
  version: "2022-12-01"  # API version
  model_name: "text-embedding-ada-002"
  private_deployment: "Yes"  # or "No"
  deployment_name: "your-deployment-name"  # if private_deployment is "Yes"

Hugging Face

Python variable (as JSON)

{
    "embedding_model": "Hugging Face",
    "api_key": "your-huggingface-api-key",
    "model_name": "sentence-transformers/all-MiniLM-L6-v2"
}

YAML

embedding:
  embedding_model: "Hugging Face"
  api_key: "your-huggingface-api-key"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"

Choosing the Right Embedding Model

When selecting an embedding model, consider the following factors:

Language support: Ensure the model supports the languages in your data.
Embedding dimension: Different models produce embeddings of different sizes.
Licensing and cost: Be aware of usage limits and pricing for API-based models.
Performance: Consider the trade-off between embedding quality and computation time/cost.

Adding Custom Embedding Models

To add a custom embedding model:

Create a new file in the embedding_mods directory.
Implement a new class that inherits from BaseEmbedding.
Implement the required embed() method.
Update the get_embedding_model() function in embedding_mods/__init__.py to include your new model.

Example of a custom embedding class:

from .base import BaseEmbedding

class MyCustomEmbedding(BaseEmbedding):
    def __init__(self, config):
        self.config = config
        # Initialize your model here

    def embed(self, df, embed_column='__concat_final'):
        # Implement embedding logic here
        # Return the dataframe with a new 'embeddings' column
        return df