4. Configuration

VectorETL uses a configuration file to specify the details of the source, embedding model, target database, and other parameters. You can use either YAML or JSON format for the configuration file.

Configuration Format

The configuration file is divided into three main sections:

  1. source: Specifies the data source details

  2. embedding: Defines the embedding model to be used

  3. target: Outlines the target vector database

  4. embed_columns: Defines the columns that need to be embedded (mainly for structured data sources)

Source Configuration

The source section varies based on the source_data_type. Here’s a general structure:

source:
  source_data_type: "type_of_source"
  # Other source-specific parameters

You can view all of the existing source configurations here

Embedding Configuration

The embedding section specifies which embedding model to use:

embedding:
  embedding_model: "model_name"
  api_key: "your_api_key"
  model_name: "specific_model_name"

You can view all of the existing embedding configurations here

Target Configuration

The target section varies based on the chosen vector database:

target:
  target_database: "database_name"
  # Database-specific parameters

You can view all of the existing target configurations here

Embed Columns Configuration

The embed_columns section requires you to specify the columns from the source data that you want to embed. NOTE: This is only required for database sources (PostgreSQL, MySQL, Snowflake, Salesforce). Support for csv, json and operational stores coming soon

If your source is a non-database source, you can just pass an empty array

embed_columns: #if this is a database source
  - "column1"
  - "column2"
  - "column3"

embed_columns: [] #if this is a database source

Full Configuration Example

Here’s an example of a complete configuration file:

source:
  source_data_type: "database"
  db_type: "postgres"
  host: "localhost"
  database_name: "mydb"
  username: "user"
  password: "password"
  port: 5432
  query: "SELECT * FROM mytable WHERE updated_at > :last_updated_at"
  batch_size: 1000
  chunk_size: 1000
  chunk_overlap: 0

embedding:
  embedding_model: "OpenAI"
  api_key: "your-openai-api-key"
  model_name: "text-embedding-ada-002"

target:
  target_database: "Pinecone"
  pinecone_api_key: "your-pinecone-api-key"
  index_name: "my-index"
  dimension: 1536
  metric: "cosine"

embed_columns:
  - "column1"
  - "column2"
  - "column3"

While yaml configuration is preferred, this is also an option to pass a json file

{
  "source": {
    "source_data_type": "database",
    "db_type": "postgres",
    "host": "localhost",
    "database_name": "mydb",
    "username": "user",
    "password": "password",
    "port": 5432,
    "query": "SELECT * FROM mytable WHERE updated_at > :last_updated_at",
    "batch_size": 1000,
    "chunk_size": 1000,
    "chunk_overlap": 0
  },
  "embedding": {
    "embedding_model": "OpenAI",
    "api_key": "your-openai-api-key",
    "model_name": "text-embedding-ada-002"
  },
  "target": {
    "target_database": "Pinecone",
    "pinecone_api_key": "your-pinecone-api-key",
    "index_name": "my-index",
    "dimension": 1536,
    "metric": "cosine",
    "cloud": "aws",
    "region": "us-west-2"
  },
  "embed_columns": ["column1", "column2", "column3"]
}