2. Getting Started
This section will guide you through the process of installing VectorETL and running your first ETL pipeline.
Installation Guide
Using pip (recommended for most users)
To install VectorETL using pip, run the following command:
pip install vector-etl
or
pip install git+https://github.com/ContextData/VectorETL.git
Quick Start Example
Here are a few options to get you started with VectorETL:
Option 1: Import VectorETL into your existing python application
Within your current RAG or vector search application, you can include a file or snippet similar to the code below
from vector_etl import create_flow
source = {
"source_data_type": "database",
"db_type": "postgres",
"host": "localhost",
"port": "5432",
"database_name": "test",
"username": "user",
"password": "password",
"query": "select * from test",
"batch_size": 1000,
"chunk_size": 1000,
"chunk_overlap": 0,
}
embedding = {
"embedding_model": "OpenAI",
"api_key": 'my-openai-key',
"model_name": "text-embedding-ada-002"
}
target = {
"target_database": "Pinecone",
"pinecone_api_key": 'my-pinecone-key',
"index_name": "my-pinecone-index",
"dimension": 1536
}
embed_columns = ["customer_name", "customer_description", "purchase_history"]
flow = create_flow()
flow.set_source(source)
flow.set_embedding(embedding)
flow.set_target(target)
flow.set_embed_columns(embed_columns)
# Execute the flow
flow.execute()
Option 2: Import VectorETL into your python application (using a yaml configuration file)
You can import the configuration into your python project and automatically run it from there.
Create a configuration file as shown below:
source:
source_data_type: "Local File"
file_path: "/path/to/your/data.csv"
file_type: "csv"
embedding:
embedding_model: "OpenAI"
api_key: "your-openai-api-key"
model_name: "text-embedding-ada-002"
target:
target_database: "Pinecone"
pinecone_api_key: "your-pinecone-api-key"
index_name: "my-vector-index"
embed_columns:
- "text_column1"
- "text_column2"
Import the configuration file into your python application:
from vector_etl import create_flow
flow = create_flow()
flow.load_yaml('/path/to/your/config.yaml')
flow.execute()
Option 3: Running from the command line using a configuration file
Using the same configuration file as shown in Option 2
source:
source_data_type: "Local File"
file_path: "/path/to/your/data.csv"
file_type: "csv"
embedding:
embedding_model: "OpenAI"
api_key: "your-openai-api-key"
model_name: "text-embedding-ada-002"
target:
target_database: "Pinecone"
pinecone_api_key: "your-pinecone-api-key"
index_name: "my-vector-index"
embed_columns:
- "text_column1"
- "text_column2"
Run the ETL process:
vector-etl -c /path/to/your/config.yaml
This will process the CSV file, create embeddings using OpenAI’s model, and store them in a Pinecone index. Basic Configuration The configuration file is divided into four main sections:
source
: Specifies the data source detailsembedding
: Defines the embedding model to be usedtarget
: Outlines the target vector databaseembed_columns
: Specifies the columns/fields that will be embedded and written as metadata to the vector database
Adjust these sections according to your specific data source, preferred embedding model, and target vector database.