6. Embedding Models
Embedding models are crucial in transforming your raw data into vector representations. VectorETL supports several popular embedding models, allowing you to choose the one that best fits your needs.
OpenAI
Python variable (as JSON)
{
"embedding_model": "OpenAI",
"api_key": "your-openai-api-key",
"model_name": "text-embedding-ada-002"
}
YAML
embedding:
embedding_model: "OpenAI"
api_key: "your-openai-api-key"
model_name: "text-embedding-ada-002"
Cohere
Python variable (as JSON)
{
"embedding_model": "Cohere",
"api_key": "your-cohere-api-key",
"model_name": "embed-english-v2.0"
}
YAML
embedding:
embedding_model: "Cohere"
api_key: "your-cohere-api-key"
model_name: "embed-english-v2.0"
Google Gemini
Python variable (as JSON)
{
"embedding_model": "Google Gemini",
"api_key": "your-gemini-api-key",
"model_name": "embedding-001"
}
YAML
embedding:
embedding_model: "Google Gemini"
api_key: "your-gemini-api-key"
model_name: "embedding-001"
Azure OpenAI
Python variable (as JSON)
{
"embedding_model": "Azure OpenAI",
"api_key": "your-azure-openai-api-key",
"endpoint": "your-azure-openai-endpoint",
"version": "2022-12-01",
"model_name": "text-embedding-ada-002",
"private_deployment": "Yes",
"deployment_name": "your-deployment-name"
}
YAML
embedding:
embedding_model: "Azure OpenAI"
api_key: "your-azure-openai-api-key"
endpoint: "your-azure-openai-endpoint"
version: "2022-12-01" # API version
model_name: "text-embedding-ada-002"
private_deployment: "Yes" # or "No"
deployment_name: "your-deployment-name" # if private_deployment is "Yes"
Hugging Face
Python variable (as JSON)
{
"embedding_model": "Hugging Face",
"api_key": "your-huggingface-api-key",
"model_name": "sentence-transformers/all-MiniLM-L6-v2"
}
YAML
embedding:
embedding_model: "Hugging Face"
api_key: "your-huggingface-api-key"
model_name: "sentence-transformers/all-MiniLM-L6-v2"
Choosing the Right Embedding Model
When selecting an embedding model, consider the following factors:
Language support: Ensure the model supports the languages in your data.
Embedding dimension: Different models produce embeddings of different sizes.
Licensing and cost: Be aware of usage limits and pricing for API-based models.
Performance: Consider the trade-off between embedding quality and computation time/cost.
Adding Custom Embedding Models
To add a custom embedding model:
Create a new file in the
embedding_mods
directory.Implement a new class that inherits from
BaseEmbedding
.Implement the required
embed()
method.Update the
get_embedding_model()
function inembedding_mods/__init__.py
to include your new model.
Example of a custom embedding class:
from .base import BaseEmbedding
class MyCustomEmbedding(BaseEmbedding):
def __init__(self, config):
self.config = config
# Initialize your model here
def embed(self, df, embed_column='__concat_final'):
# Implement embedding logic here
# Return the dataframe with a new 'embeddings' column
return df