4. Configuration

VectorETL uses a configuration file to specify the details of the source, embedding model, target database, and other parameters. You can use either YAML or JSON format for the configuration file.

Configuration Format

The configuration file is divided into three main sections:

source: Specifies the data source details
embedding: Defines the embedding model to be used
target: Outlines the target vector database
embed_columns: Defines the columns that need to be embedded (mainly for structured data sources)

Source Configuration

The source section varies based on the source_data_type. Here’s a general structure:

source:
  source_data_type: "type_of_source"
  # Other source-specific parameters

You can view all of the existing source configurations here

Embedding Configuration

The embedding section specifies which embedding model to use:

embedding:
  embedding_model: "model_name"
  api_key: "your_api_key"
  model_name: "specific_model_name"

You can view all of the existing embedding configurations here

Target Configuration

The target section varies based on the chosen vector database:

target:
  target_database: "database_name"
  # Database-specific parameters

You can view all of the existing target configurations here

Embed Columns Configuration

The embed_columns section requires you to specify the columns from the source data that you want to embed. NOTE: This is only required for database sources (PostgreSQL, MySQL, Snowflake, Salesforce). Support for csv, json and operational stores coming soon

If your source is a non-database source, you can just pass an empty array

embed_columns: #if this is a database source
  - "column1"
  - "column2"
  - "column3"

embed_columns: [] #if this is a database source

Full Configuration Example

Here’s an example of a complete configuration file:

source:
  source_data_type: "database"
  db_type: "postgres"
  host: "localhost"
  database_name: "mydb"
  username: "user"
  password: "password"
  port: 5432
  query: "SELECT * FROM mytable WHERE updated_at > :last_updated_at"
  batch_size: 1000
  chunk_size: 1000
  chunk_overlap: 0

embedding:
  embedding_model: "OpenAI"
  api_key: "your-openai-api-key"
  model_name: "text-embedding-ada-002"

target:
  target_database: "Pinecone"
  pinecone_api_key: "your-pinecone-api-key"
  index_name: "my-index"
  dimension: 1536
  metric: "cosine"

embed_columns:
  - "column1"
  - "column2"
  - "column3"

While yaml configuration is preferred, this is also an option to pass a json file

{
  "source": {
    "source_data_type": "database",
    "db_type": "postgres",
    "host": "localhost",
    "database_name": "mydb",
    "username": "user",
    "password": "password",
    "port": 5432,
    "query": "SELECT * FROM mytable WHERE updated_at > :last_updated_at",
    "batch_size": 1000,
    "chunk_size": 1000,
    "chunk_overlap": 0
  },
  "embedding": {
    "embedding_model": "OpenAI",
    "api_key": "your-openai-api-key",
    "model_name": "text-embedding-ada-002"
  },
  "target": {
    "target_database": "Pinecone",
    "pinecone_api_key": "your-pinecone-api-key",
    "index_name": "my-index",
    "dimension": 1536,
    "metric": "cosine",
    "cloud": "aws",
    "region": "us-west-2"
  },
  "embed_columns": ["column1", "column2", "column3"]
}