## 4. Configuration

VectorETL uses a configuration file to specify the details of the source, embedding model, target database, and other parameters. You can use either YAML or JSON format for the configuration file.

### Configuration Format

The configuration file is divided into three main sections:

1. `source`: Specifies the data source details
2. `embedding`: Defines the embedding model to be used
3. `target`: Outlines the target vector database
4. `embed_columns`: Defines the columns that need to be embedded (mainly for structured data sources)

### Source Configuration

The `source` section varies based on the `source_data_type`. Here's a general structure:

```yaml
source:
  source_data_type: "type_of_source"
  # Other source-specific parameters
```

You can view all of the existing source configurations [here](data_sources)

### Embedding Configuration

The `embedding` section specifies which embedding model to use:

```yaml
embedding:
  embedding_model: "model_name"
  api_key: "your_api_key"
  model_name: "specific_model_name"
```

You can view all of the existing embedding configurations [here](embedding_models)

### Target Configuration

The `target` section varies based on the chosen vector database:

```yaml
target:
  target_database: "database_name"
  # Database-specific parameters
```

You can view all of the existing target configurations [here](vector_dbs)

### Embed Columns Configuration

The `embed_columns` section requires you to specify the columns from the source data that you want to embed.
NOTE: This is only required for database sources (PostgreSQL, MySQL, Snowflake, Salesforce).
Support for csv, json and operational stores coming soon

If your source is a non-database source, you can just pass an empty array

```yaml
embed_columns: #if this is a database source
  - "column1"
  - "column2"
  - "column3"

embed_columns: [] #if this is a database source
```

### Full Configuration Example

Here's an example of a complete configuration file:

```yaml
source:
  source_data_type: "database"
  db_type: "postgres"
  host: "localhost"
  database_name: "mydb"
  username: "user"
  password: "password"
  port: 5432
  query: "SELECT * FROM mytable WHERE updated_at > :last_updated_at"
  batch_size: 1000
  chunk_size: 1000
  chunk_overlap: 0

embedding:
  embedding_model: "OpenAI"
  api_key: "your-openai-api-key"
  model_name: "text-embedding-ada-002"

target:
  target_database: "Pinecone"
  pinecone_api_key: "your-pinecone-api-key"
  index_name: "my-index"
  dimension: 1536
  metric: "cosine"

embed_columns:
  - "column1"
  - "column2"
  - "column3"
```

While yaml configuration is preferred, this is also an option to pass a json file

```json
{
  "source": {
    "source_data_type": "database",
    "db_type": "postgres",
    "host": "localhost",
    "database_name": "mydb",
    "username": "user",
    "password": "password",
    "port": 5432,
    "query": "SELECT * FROM mytable WHERE updated_at > :last_updated_at",
    "batch_size": 1000,
    "chunk_size": 1000,
    "chunk_overlap": 0
  },
  "embedding": {
    "embedding_model": "OpenAI",
    "api_key": "your-openai-api-key",
    "model_name": "text-embedding-ada-002"
  },
  "target": {
    "target_database": "Pinecone",
    "pinecone_api_key": "your-pinecone-api-key",
    "index_name": "my-index",
    "dimension": 1536,
    "metric": "cosine",
    "cloud": "aws",
    "region": "us-west-2"
  },
  "embed_columns": ["column1", "column2", "column3"]
}
```