2. Getting Started

This section will guide you through the process of installing VectorETL and running your first ETL pipeline.

Installation Guide

Using pip (recommended for most users)

To install VectorETL using pip, run the following command:

pip install vector-etl

or

pip install git+https://github.com/ContextData/VectorETL.git

Quick Start Example

Here are a few options to get you started with VectorETL:

Option 1: Import VectorETL into your existing python application

Within your current RAG or vector search application, you can include a file or snippet similar to the code below

from vector_etl import create_flow

source = {
    "source_data_type": "database",
    "db_type": "postgres",
    "host": "localhost",
    "port": "5432",
    "database_name": "test",
    "username": "user",
    "password": "password",
    "query": "select * from test",
    "batch_size": 1000,
    "chunk_size": 1000,
    "chunk_overlap": 0,
}

embedding = {
    "embedding_model": "OpenAI",
    "api_key": 'my-openai-key',
    "model_name": "text-embedding-ada-002"
}

target = {
    "target_database": "Pinecone",
    "pinecone_api_key": 'my-pinecone-key',
    "index_name": "my-pinecone-index",
    "dimension": 1536
}

embed_columns = ["customer_name", "customer_description", "purchase_history"]

flow = create_flow()
flow.set_source(source)
flow.set_embedding(embedding)
flow.set_target(target)
flow.set_embed_columns(embed_columns)

# Execute the flow
flow.execute()

Option 2: Import VectorETL into your python application (using a yaml configuration file)

You can import the configuration into your python project and automatically run it from there.

Create a configuration file as shown below:

source:
  source_data_type: "Local File"
  file_path: "/path/to/your/data.csv"
  file_type: "csv"

embedding:
  embedding_model: "OpenAI"
  api_key: "your-openai-api-key"
  model_name: "text-embedding-ada-002"

target:
  target_database: "Pinecone"
  pinecone_api_key: "your-pinecone-api-key"
  index_name: "my-vector-index"

embed_columns:
  - "text_column1"
  - "text_column2"

Import the configuration file into your python application:

from vector_etl import create_flow

flow = create_flow()
flow.load_yaml('/path/to/your/config.yaml')
flow.execute()

Option 3: Running from the command line using a configuration file

Using the same configuration file as shown in Option 2

source:
  source_data_type: "Local File"
  file_path: "/path/to/your/data.csv"
  file_type: "csv"

embedding:
  embedding_model: "OpenAI"
  api_key: "your-openai-api-key"
  model_name: "text-embedding-ada-002"

target:
  target_database: "Pinecone"
  pinecone_api_key: "your-pinecone-api-key"
  index_name: "my-vector-index"

embed_columns:
  - "text_column1"
  - "text_column2"

Run the ETL process:

vector-etl -c /path/to/your/config.yaml

This will process the CSV file, create embeddings using OpenAI’s model, and store them in a Pinecone index. Basic Configuration The configuration file is divided into four main sections:

source: Specifies the data source details
embedding: Defines the embedding model to be used
target: Outlines the target vector database
embed_columns: Specifies the columns/fields that will be embedded and written as metadata to the vector database

Adjust these sections according to your specific data source, preferred embedding model, and target vector database.