## 5. Data Sources

VectorETL supports a wide range of data sources, allowing you to extract data from various systems and formats.

### Overview of Supported Sources

- Amazon S3
- Box
- Local files
- Databases (PostgreSQL, MySQL, Snowflake, Salesforce)
- Dropbox
- Stripe
- Zendesk
- Google Drive
- Google Cloud Storage

### Amazon S3

**Python variable (as JSON)**
```json
{
    "source_data_type": "Amazon S3",
    "bucket_name": "myBucket",
    "prefix": "Dir/Subdir/",
    "file_type": "csv",
    "aws_access_key_id": "your-access-key",
    "aws_secret_access_key": "your-secret-access-key",
    "chunk_size": 1000,
    "chunk_overlap": 0
}
```

**YAML**
```yaml
source:
  source_data_type: "Amazon S3"
  bucket_name: "myBucket"
  prefix: "Dir/Subdir/"
  file_type: "csv"
  aws_access_key_id: "your-access-key"
  aws_secret_access_key: "your-secret-access-key"
  chunk_size: 1000
  chunk_overlap: 0
```

### Box

**Python variable (as JSON)**
```json
{
    "source_data_type": "Box",
    "folder_path": "MyFolder",
    "file_type": "pdf",
    "access_token": "your-developer-token",
    "chunk_size": 1000,
    "chunk_overlap": 200
}
```

**YAML**
```yaml
source:
  source_data_type: "Box"
  folder_path: "MyFolder"
  file_type: "pdf"
  access_token: "your-developer-token"
  chunk_size: 1000
  chunk_overlap: 200
```

### Local Files

**Python variable (as JSON)**
```json
{
    "source_data_type": "Local File",
    "file_path": "/path/to/your/data/",
    "file_type": "csv",
    "chunk_size": 1000,
    "chunk_overlap": 0
}
```

**YAML**
```yaml
source:
  source_data_type: "Local File"
  file_path: "/path/to/your/data/"
  file_type: "csv"
  chunk_size: 1000
  chunk_overlap: 0
```

### Database (PostgreSQL example)

**Python variable (as JSON)**
```json
{
    "source_data_type": "database",
    "db_type": "postgres",
    "host": "localhost",
    "database_name": "mydb",
    "username": "user",
    "password": "password",
    "port": 5432,
    "query": "SELECT * FROM mytable WHERE updated_at > :last_updated_at",
    "batch_size": 1000,
    "chunk_size": 1000,
    "chunk_overlap": 0
}
```

**YAML**
```yaml
source:
  source_data_type: "database"
  db_type: "postgres"
  host: "localhost"
  database_name: "mydb"
  username: "user"
  password: "password"
  port: 5432
  query: "SELECT * FROM mytable WHERE updated_at > :last_updated_at"
  batch_size: 1000
  chunk_size: 1000
  chunk_overlap: 0
```

### Dropbox

**Python variable (as JSON)**
```json
{
    "source_data_type": "Dropbox",
    "key": "your-dropbox-key",
    "folder_path": "/path/to/folder/",
    "file_type": "csv",
    "chunk_size": 1000,
    "chunk_overlap": 0
}
```

**YAML**
```yaml
source:
  source_data_type: "Dropbox"
  key: "your-dropbox-key"
  folder_path: "/path/to/folder/"
  file_type: "csv"
  chunk_size: 1000
  chunk_overlap: 0
```

### Stripe

**Python variable (as JSON)**
```json
{
    "source_data_type": "stripe",
    "access_token": "your-stripe-access-token",
    "table": "charges",
    "chunk_size": 1000,
    "chunk_overlap": 0
}
```

**YAML**
```yaml
source:
  source_data_type: "stripe"
  access_token: "your-stripe-access-token"
  table: "charges"
  chunk_size: 1000
  chunk_overlap: 0
```

### Zendesk

**Python variable (as JSON)**
```json
{
    "source_data_type": "zendesk",
    "user_email": "your-email@example.com",
    "access_token": "your-zendesk-access-token",
    "subdomain": "your-subdomain",
    "table": "tickets",
    "chunk_size": 1000,
    "chunk_overlap": 0
}
```

**YAML**
```yaml
source:
  source_data_type: "zendesk"
  user_email: "your-email@example.com"
  access_token: "your-zendesk-access-token"
  subdomain: "your-subdomain"
  table: "tickets"
  chunk_size: 1000
  chunk_overlap: 0
```

### Google Drive

**Python variable (as JSON)**
```json
{
    "source_data_type": "Google Drive",
    "credentials_path": "/path/to/your/credentials.json",
    "folder_id": "your-folder-id",
    "file_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    "chunk_size": 1000,
    "chunk_overlap": 0
}
```

**YAML**
```yaml
source:
  source_data_type: "Google Drive"
  credentials_path: "/path/to/your/credentials.json"
  folder_id: "your-folder-id"
  file_type: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
  chunk_size: 1000
  chunk_overlap: 0
```

### Google Cloud Storage

**Python variable (as JSON)**
```json
{
    "source_data_type": "Google Cloud Storage",
    "credentials_path": "/path/to/your/credentials.json",
    "bucket_name": "myBucket",
    "prefix": "prefix/",
    "file_type": "csv",
    "chunk_size": 1000,
    "chunk_overlap": 0
}
```

**YAML**
```yaml
source:
  source_data_type: "Google Cloud Storage"
  credentials_path: "/path/to/your/credentials.json"
  bucket_name: "myBucket"
  prefix: "prefix/"
  file_type: "csv"
  chunk_size: 1000
  chunk_overlap: 0
```

### Using Unstructured to process files

Starting from version 0.1.6.3, you can now add Unstructured as file processing API. Users can now utilize the [Unstructured's Serverless API](https://unstructured.io/api-key-hosted) to efficiently extract data from a multitude of file based sources.

**This is limited to [PDF, DOCX, DOC, TXT] files**

In order to use Unstructured, you will need three additional parameters

1. `use_unstructured`: (True/False) indicator telling the framework to use the Unstructured API
2. `unstructured_api_key`: Enter your Unstructured API Key
3. `unstructured_url`: Enter your API Url from your Unstructured dashboard

```python
# Example using Local file
from vector_etl import create_flow

source = {
    "source_data_type": "Local File",
    "file_path": "/path/to/file.docx",
    "file_type": "docx",
    "use_unstructured": true,
    "unstructured_api_key": "my-unstructured-key",
    "unstructured_url": "https://my-domain.api.unstructuredapp.io"
}

# Example using Amazon S3
from vector_etl import create_flow

source = {
    "source_data_type": "Amazon S3",
    "bucket_name": "myBucket",
    "prefix": "Dir/Subdir/",
    "file_type": "pdf",
    "aws_access_key_id": "your-access-key",
    "aws_secret_access_key": "your-secret-access-key",
    "use_unstructured": true,
    "unstructured_api_key": "my-unstructured-key",
    "unstructured_url": "https://my-domain.api.unstructuredapp.io"
}

# Example using Google Cloud Storage
from vector_etl import create_flow

source = {
    "source_data_type": "Google Cloud Storage",
    "credentials_path": "/path/to/your/credentials.json",
    "bucket_name": "myBucket",
    "prefix": "prefix/",
    "file_type": "csv",
    "use_unstructured": true,
    "unstructured_api_key": "my-unstructured-key",
    "unstructured_url": "https://my-domain.api.unstructuredapp.io"
}
```


### Adding Custom Data Sources

To add a custom data source:

1. Create a new file in the `source_mods` directory.
2. Implement a new class that inherits from `BaseSource`.
3. Implement the required methods: `connect()` and `fetch_data()`.
4. Update the `get_source_class()` function in `source_mods/__init__.py` to include your new source.

Example of a custom source class:

```python
from .base import BaseSource

class MyCustomSource(BaseSource):
    def __init__(self, config):
        self.config = config

    def connect(self):
        # Implement connection logic here
        pass

    def fetch_data(self):
        # Implement data fetching logic here
        pass
```