6. Embedding Models

Embedding models are crucial in transforming your raw data into vector representations. VectorETL supports several popular embedding models, allowing you to choose the one that best fits your needs.

OpenAI

Python variable (as JSON)

{
    "embedding_model": "OpenAI",
    "api_key": "your-openai-api-key",
    "model_name": "text-embedding-ada-002"
}

YAML

embedding:
  embedding_model: "OpenAI"
  api_key: "your-openai-api-key"
  model_name: "text-embedding-ada-002"

Cohere

Python variable (as JSON)

{
    "embedding_model": "Cohere",
    "api_key": "your-cohere-api-key",
    "model_name": "embed-english-v2.0"
}

YAML

embedding:
  embedding_model: "Cohere"
  api_key: "your-cohere-api-key"
  model_name: "embed-english-v2.0"

Google Gemini

Python variable (as JSON)

{
    "embedding_model": "Google Gemini",
    "api_key": "your-gemini-api-key",
    "model_name": "embedding-001"
}

YAML

embedding:
  embedding_model: "Google Gemini"
  api_key: "your-gemini-api-key"
  model_name: "embedding-001"

Azure OpenAI

Python variable (as JSON)

{
    "embedding_model": "Azure OpenAI",
    "api_key": "your-azure-openai-api-key",
    "endpoint": "your-azure-openai-endpoint",
    "version": "2022-12-01",
    "model_name": "text-embedding-ada-002",
    "private_deployment": "Yes",
    "deployment_name": "your-deployment-name"
}

YAML

embedding:
  embedding_model: "Azure OpenAI"
  api_key: "your-azure-openai-api-key"
  endpoint: "your-azure-openai-endpoint"
  version: "2022-12-01"  # API version
  model_name: "text-embedding-ada-002"
  private_deployment: "Yes"  # or "No"
  deployment_name: "your-deployment-name"  # if private_deployment is "Yes"

Hugging Face

Python variable (as JSON)

{
    "embedding_model": "Hugging Face",
    "api_key": "your-huggingface-api-key",
    "model_name": "sentence-transformers/all-MiniLM-L6-v2"
}

YAML

embedding:
  embedding_model: "Hugging Face"
  api_key: "your-huggingface-api-key"
  model_name: "sentence-transformers/all-MiniLM-L6-v2"

Choosing the Right Embedding Model

When selecting an embedding model, consider the following factors:

  • Language support: Ensure the model supports the languages in your data.

  • Embedding dimension: Different models produce embeddings of different sizes.

  • Licensing and cost: Be aware of usage limits and pricing for API-based models.

  • Performance: Consider the trade-off between embedding quality and computation time/cost.

Adding Custom Embedding Models

To add a custom embedding model:

  1. Create a new file in the embedding_mods directory.

  2. Implement a new class that inherits from BaseEmbedding.

  3. Implement the required embed() method.

  4. Update the get_embedding_model() function in embedding_mods/__init__.py to include your new model.

Example of a custom embedding class:

from .base import BaseEmbedding

class MyCustomEmbedding(BaseEmbedding):
    def __init__(self, config):
        self.config = config
        # Initialize your model here

    def embed(self, df, embed_column='__concat_final'):
        # Implement embedding logic here
        # Return the dataframe with a new 'embeddings' column
        return df