4. Configuration
VectorETL uses a configuration file to specify the details of the source, embedding model, target database, and other parameters. You can use either YAML or JSON format for the configuration file.
Configuration Format
The configuration file is divided into three main sections:
source
: Specifies the data source detailsembedding
: Defines the embedding model to be usedtarget
: Outlines the target vector databaseembed_columns
: Defines the columns that need to be embedded (mainly for structured data sources)
Source Configuration
The source
section varies based on the source_data_type
. Here’s a general structure:
source:
source_data_type: "type_of_source"
# Other source-specific parameters
You can view all of the existing source configurations here
Embedding Configuration
The embedding
section specifies which embedding model to use:
embedding:
embedding_model: "model_name"
api_key: "your_api_key"
model_name: "specific_model_name"
You can view all of the existing embedding configurations here
Target Configuration
The target
section varies based on the chosen vector database:
target:
target_database: "database_name"
# Database-specific parameters
You can view all of the existing target configurations here
Embed Columns Configuration
The embed_columns
section requires you to specify the columns from the source data that you want to embed.
NOTE: This is only required for database sources (PostgreSQL, MySQL, Snowflake, Salesforce).
Support for csv, json and operational stores coming soon
If your source is a non-database source, you can just pass an empty array
embed_columns: #if this is a database source
- "column1"
- "column2"
- "column3"
embed_columns: [] #if this is a database source
Full Configuration Example
Here’s an example of a complete configuration file:
source:
source_data_type: "database"
db_type: "postgres"
host: "localhost"
database_name: "mydb"
username: "user"
password: "password"
port: 5432
query: "SELECT * FROM mytable WHERE updated_at > :last_updated_at"
batch_size: 1000
chunk_size: 1000
chunk_overlap: 0
embedding:
embedding_model: "OpenAI"
api_key: "your-openai-api-key"
model_name: "text-embedding-ada-002"
target:
target_database: "Pinecone"
pinecone_api_key: "your-pinecone-api-key"
index_name: "my-index"
dimension: 1536
metric: "cosine"
embed_columns:
- "column1"
- "column2"
- "column3"
While yaml configuration is preferred, this is also an option to pass a json file
{
"source": {
"source_data_type": "database",
"db_type": "postgres",
"host": "localhost",
"database_name": "mydb",
"username": "user",
"password": "password",
"port": 5432,
"query": "SELECT * FROM mytable WHERE updated_at > :last_updated_at",
"batch_size": 1000,
"chunk_size": 1000,
"chunk_overlap": 0
},
"embedding": {
"embedding_model": "OpenAI",
"api_key": "your-openai-api-key",
"model_name": "text-embedding-ada-002"
},
"target": {
"target_database": "Pinecone",
"pinecone_api_key": "your-pinecone-api-key",
"index_name": "my-index",
"dimension": 1536,
"metric": "cosine",
"cloud": "aws",
"region": "us-west-2"
},
"embed_columns": ["column1", "column2", "column3"]
}