## 2. Getting Started This section will guide you through the process of installing VectorETL and running your first ETL pipeline. ### Installation Guide #### Using pip (recommended for most users) To install VectorETL using pip, run the following command: ```bash pip install vector-etl ``` or ```bash pip install git+https://github.com/ContextData/VectorETL.git ``` #### Quick Start Example Here are a few options to get you started with VectorETL: ##### Option 1: Import VectorETL into your existing python application Within your current RAG or vector search application, you can include a file or snippet similar to the code below ```python from vector_etl import create_flow source = { "source_data_type": "database", "db_type": "postgres", "host": "localhost", "port": "5432", "database_name": "test", "username": "user", "password": "password", "query": "select * from test", "batch_size": 1000, "chunk_size": 1000, "chunk_overlap": 0, } embedding = { "embedding_model": "OpenAI", "api_key": 'my-openai-key', "model_name": "text-embedding-ada-002" } target = { "target_database": "Pinecone", "pinecone_api_key": 'my-pinecone-key', "index_name": "my-pinecone-index", "dimension": 1536 } embed_columns = ["customer_name", "customer_description", "purchase_history"] flow = create_flow() flow.set_source(source) flow.set_embedding(embedding) flow.set_target(target) flow.set_embed_columns(embed_columns) # Execute the flow flow.execute() ``` ##### Option 2: Import VectorETL into your python application (using a yaml configuration file) You can import the configuration into your python project and automatically run it from there. 1. Create a configuration file as shown below: ```yaml source: source_data_type: "Local File" file_path: "/path/to/your/data.csv" file_type: "csv" embedding: embedding_model: "OpenAI" api_key: "your-openai-api-key" model_name: "text-embedding-ada-002" target: target_database: "Pinecone" pinecone_api_key: "your-pinecone-api-key" index_name: "my-vector-index" embed_columns: - "text_column1" - "text_column2" ``` 2. Import the configuration file into your python application: ```python from vector_etl import create_flow flow = create_flow() flow.load_yaml('/path/to/your/config.yaml') flow.execute() ``` ##### Option 3: Running from the command line using a configuration file 1. Using the same configuration file as shown in Option 2 ```yaml source: source_data_type: "Local File" file_path: "/path/to/your/data.csv" file_type: "csv" embedding: embedding_model: "OpenAI" api_key: "your-openai-api-key" model_name: "text-embedding-ada-002" target: target_database: "Pinecone" pinecone_api_key: "your-pinecone-api-key" index_name: "my-vector-index" embed_columns: - "text_column1" - "text_column2" ``` 2. Run the ETL process: ```bash vector-etl -c /path/to/your/config.yaml ``` This will process the CSV file, create embeddings using OpenAI's model, and store them in a Pinecone index. Basic Configuration The configuration file is divided into four main sections: 1. `source`: Specifies the data source details 2. `embedding`: Defines the embedding model to be used 3. `target`: Outlines the target vector database 4. `embed_columns`: Specifies the columns/fields that will be embedded and written as metadata to the vector database Adjust these sections according to your specific data source, preferred embedding model, and target vector database.