## 4. Configuration VectorETL uses a configuration file to specify the details of the source, embedding model, target database, and other parameters. You can use either YAML or JSON format for the configuration file. ### Configuration Format The configuration file is divided into three main sections: 1. `source`: Specifies the data source details 2. `embedding`: Defines the embedding model to be used 3. `target`: Outlines the target vector database 4. `embed_columns`: Defines the columns that need to be embedded (mainly for structured data sources) ### Source Configuration The `source` section varies based on the `source_data_type`. Here's a general structure: ```yaml source: source_data_type: "type_of_source" # Other source-specific parameters ``` You can view all of the existing source configurations [here](data_sources) ### Embedding Configuration The `embedding` section specifies which embedding model to use: ```yaml embedding: embedding_model: "model_name" api_key: "your_api_key" model_name: "specific_model_name" ``` You can view all of the existing embedding configurations [here](embedding_models) ### Target Configuration The `target` section varies based on the chosen vector database: ```yaml target: target_database: "database_name" # Database-specific parameters ``` You can view all of the existing target configurations [here](vector_dbs) ### Embed Columns Configuration The `embed_columns` section requires you to specify the columns from the source data that you want to embed. NOTE: This is only required for database sources (PostgreSQL, MySQL, Snowflake, Salesforce). Support for csv, json and operational stores coming soon If your source is a non-database source, you can just pass an empty array ```yaml embed_columns: #if this is a database source - "column1" - "column2" - "column3" embed_columns: [] #if this is a database source ``` ### Full Configuration Example Here's an example of a complete configuration file: ```yaml source: source_data_type: "database" db_type: "postgres" host: "localhost" database_name: "mydb" username: "user" password: "password" port: 5432 query: "SELECT * FROM mytable WHERE updated_at > :last_updated_at" batch_size: 1000 chunk_size: 1000 chunk_overlap: 0 embedding: embedding_model: "OpenAI" api_key: "your-openai-api-key" model_name: "text-embedding-ada-002" target: target_database: "Pinecone" pinecone_api_key: "your-pinecone-api-key" index_name: "my-index" dimension: 1536 metric: "cosine" embed_columns: - "column1" - "column2" - "column3" ``` While yaml configuration is preferred, this is also an option to pass a json file ```json { "source": { "source_data_type": "database", "db_type": "postgres", "host": "localhost", "database_name": "mydb", "username": "user", "password": "password", "port": 5432, "query": "SELECT * FROM mytable WHERE updated_at > :last_updated_at", "batch_size": 1000, "chunk_size": 1000, "chunk_overlap": 0 }, "embedding": { "embedding_model": "OpenAI", "api_key": "your-openai-api-key", "model_name": "text-embedding-ada-002" }, "target": { "target_database": "Pinecone", "pinecone_api_key": "your-pinecone-api-key", "index_name": "my-index", "dimension": 1536, "metric": "cosine", "cloud": "aws", "region": "us-west-2" }, "embed_columns": ["column1", "column2", "column3"] } ```