## 5. Data Sources VectorETL supports a wide range of data sources, allowing you to extract data from various systems and formats. ### Overview of Supported Sources - Amazon S3 - Box - Local files - Databases (PostgreSQL, MySQL, Snowflake, Salesforce) - Dropbox - Stripe - Zendesk - Google Drive - Google Cloud Storage ### Amazon S3 **Python variable (as JSON)** ```json { "source_data_type": "Amazon S3", "bucket_name": "myBucket", "prefix": "Dir/Subdir/", "file_type": "csv", "aws_access_key_id": "your-access-key", "aws_secret_access_key": "your-secret-access-key", "chunk_size": 1000, "chunk_overlap": 0 } ``` **YAML** ```yaml source: source_data_type: "Amazon S3" bucket_name: "myBucket" prefix: "Dir/Subdir/" file_type: "csv" aws_access_key_id: "your-access-key" aws_secret_access_key: "your-secret-access-key" chunk_size: 1000 chunk_overlap: 0 ``` ### Box **Python variable (as JSON)** ```json { "source_data_type": "Box", "folder_path": "MyFolder", "file_type": "pdf", "access_token": "your-developer-token", "chunk_size": 1000, "chunk_overlap": 200 } ``` **YAML** ```yaml source: source_data_type: "Box" folder_path: "MyFolder" file_type: "pdf" access_token: "your-developer-token" chunk_size: 1000 chunk_overlap: 200 ``` ### Local Files **Python variable (as JSON)** ```json { "source_data_type": "Local File", "file_path": "/path/to/your/data/", "file_type": "csv", "chunk_size": 1000, "chunk_overlap": 0 } ``` **YAML** ```yaml source: source_data_type: "Local File" file_path: "/path/to/your/data/" file_type: "csv" chunk_size: 1000 chunk_overlap: 0 ``` ### Database (PostgreSQL example) **Python variable (as JSON)** ```json { "source_data_type": "database", "db_type": "postgres", "host": "localhost", "database_name": "mydb", "username": "user", "password": "password", "port": 5432, "query": "SELECT * FROM mytable WHERE updated_at > :last_updated_at", "batch_size": 1000, "chunk_size": 1000, "chunk_overlap": 0 } ``` **YAML** ```yaml source: source_data_type: "database" db_type: "postgres" host: "localhost" database_name: "mydb" username: "user" password: "password" port: 5432 query: "SELECT * FROM mytable WHERE updated_at > :last_updated_at" batch_size: 1000 chunk_size: 1000 chunk_overlap: 0 ``` ### Dropbox **Python variable (as JSON)** ```json { "source_data_type": "Dropbox", "key": "your-dropbox-key", "folder_path": "/path/to/folder/", "file_type": "csv", "chunk_size": 1000, "chunk_overlap": 0 } ``` **YAML** ```yaml source: source_data_type: "Dropbox" key: "your-dropbox-key" folder_path: "/path/to/folder/" file_type: "csv" chunk_size: 1000 chunk_overlap: 0 ``` ### Stripe **Python variable (as JSON)** ```json { "source_data_type": "stripe", "access_token": "your-stripe-access-token", "table": "charges", "chunk_size": 1000, "chunk_overlap": 0 } ``` **YAML** ```yaml source: source_data_type: "stripe" access_token: "your-stripe-access-token" table: "charges" chunk_size: 1000 chunk_overlap: 0 ``` ### Zendesk **Python variable (as JSON)** ```json { "source_data_type": "zendesk", "user_email": "your-email@example.com", "access_token": "your-zendesk-access-token", "subdomain": "your-subdomain", "table": "tickets", "chunk_size": 1000, "chunk_overlap": 0 } ``` **YAML** ```yaml source: source_data_type: "zendesk" user_email: "your-email@example.com" access_token: "your-zendesk-access-token" subdomain: "your-subdomain" table: "tickets" chunk_size: 1000 chunk_overlap: 0 ``` ### Google Drive **Python variable (as JSON)** ```json { "source_data_type": "Google Drive", "credentials_path": "/path/to/your/credentials.json", "folder_id": "your-folder-id", "file_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet", "chunk_size": 1000, "chunk_overlap": 0 } ``` **YAML** ```yaml source: source_data_type: "Google Drive" credentials_path: "/path/to/your/credentials.json" folder_id: "your-folder-id" file_type: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" chunk_size: 1000 chunk_overlap: 0 ``` ### Google Cloud Storage **Python variable (as JSON)** ```json { "source_data_type": "Google Cloud Storage", "credentials_path": "/path/to/your/credentials.json", "bucket_name": "myBucket", "prefix": "prefix/", "file_type": "csv", "chunk_size": 1000, "chunk_overlap": 0 } ``` **YAML** ```yaml source: source_data_type: "Google Cloud Storage" credentials_path: "/path/to/your/credentials.json" bucket_name: "myBucket" prefix: "prefix/" file_type: "csv" chunk_size: 1000 chunk_overlap: 0 ``` ### Using Unstructured to process files Starting from version 0.1.6.3, you can now add Unstructured as file processing API. Users can now utilize the [Unstructured's Serverless API](https://unstructured.io/api-key-hosted) to efficiently extract data from a multitude of file based sources. **This is limited to [PDF, DOCX, DOC, TXT] files** In order to use Unstructured, you will need three additional parameters 1. `use_unstructured`: (True/False) indicator telling the framework to use the Unstructured API 2. `unstructured_api_key`: Enter your Unstructured API Key 3. `unstructured_url`: Enter your API Url from your Unstructured dashboard ```python # Example using Local file from vector_etl import create_flow source = { "source_data_type": "Local File", "file_path": "/path/to/file.docx", "file_type": "docx", "use_unstructured": true, "unstructured_api_key": "my-unstructured-key", "unstructured_url": "https://my-domain.api.unstructuredapp.io" } # Example using Amazon S3 from vector_etl import create_flow source = { "source_data_type": "Amazon S3", "bucket_name": "myBucket", "prefix": "Dir/Subdir/", "file_type": "pdf", "aws_access_key_id": "your-access-key", "aws_secret_access_key": "your-secret-access-key", "use_unstructured": true, "unstructured_api_key": "my-unstructured-key", "unstructured_url": "https://my-domain.api.unstructuredapp.io" } # Example using Google Cloud Storage from vector_etl import create_flow source = { "source_data_type": "Google Cloud Storage", "credentials_path": "/path/to/your/credentials.json", "bucket_name": "myBucket", "prefix": "prefix/", "file_type": "csv", "use_unstructured": true, "unstructured_api_key": "my-unstructured-key", "unstructured_url": "https://my-domain.api.unstructuredapp.io" } ``` ### Adding Custom Data Sources To add a custom data source: 1. Create a new file in the `source_mods` directory. 2. Implement a new class that inherits from `BaseSource`. 3. Implement the required methods: `connect()` and `fetch_data()`. 4. Update the `get_source_class()` function in `source_mods/__init__.py` to include your new source. Example of a custom source class: ```python from .base import BaseSource class MyCustomSource(BaseSource): def __init__(self, config): self.config = config def connect(self): # Implement connection logic here pass def fetch_data(self): # Implement data fetching logic here pass ```