3. Core Concepts

Understanding the core concepts of VectorETL will help you make the most of the framework.

ETL Process Overview

VectorETL follows the standard Extract, Transform, Load process:

  1. Extract: Data is retrieved from various sources (e.g., databases, file systems, APIs).

  2. Transform: The extracted data is converted into vector embeddings using specified models.

  3. Load: The resulting vectors are stored in a chosen vector database.

Vector Embeddings Explanation

Vector embeddings are numerical representations of data (often text) in a high-dimensional space. These embeddings capture semantic meaning, allowing for efficient similarity comparisons and search operations.

In VectorETL, the embedding process converts your input data into these vector representations, which can then be used for various machine learning and information retrieval tasks.

Pipeline Architecture

VectorETL’s pipeline architecture consists of three main components:

  1. Source Modules: Handle data extraction from various sources.

    • Examples: S3Source, DatabaseSource, LocalFileSource

  2. Embedding Modules: Transform extracted data into vector embeddings.

    • Examples: OpenAIEmbedding, CohereEmbedding, GoogleGeminiEmbedding

  3. Target Modules: Manage the loading of vector embeddings into vector databases.

    • Examples: PineconeTarget, QdrantTarget, WeaviateTarget

The Orchestrator coordinates these components, managing the flow of data from source through embedding to target based on the provided configuration.

This modular design allows for easy extensibility and customization of the ETL process to suit your specific needs.