indexify
indexify copied to clipboard
Extended Data Loader Support in Python SDK
Extended Data Loader Support: Add support for more data sources and formats
Current Situation
The Indexify Python SDK currently has limited support for data loading. The main data loader implementations can be found in the ./indexify/data_loaders/
directory. Specifically:
-
LocalDirectoryLoader
: Loads files from a local directory. -
UrlLoader
: Loads data from URLs.
These are defined in ./indexify/data_loaders/__init__.py
:
from .local_directory_loader import LocalDirectoryLoader
from .url_loader import UrlLoader
While these cover basic use cases, they don't address more complex data sources that are common in modern data processing and ML workflows.
Problem
The current data loader support is insufficient for many real-world scenarios, particularly those involving:
- Database systems (SQL and NoSQL)
- Cloud storage services (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage)
- Streaming data sources (e.g., Kafka, RabbitMQ)
- Specialized file formats common in data science (e.g., Parquet, Avro, HDF5)
This limitation restricts the types of data sources that can be easily integrated into Indexify workflows, potentially forcing users to implement custom loaders or preprocess data before using Indexify.
Proposed Solution
Extend the data loader support to include a wider range of data sources and formats. This will involve:
- Creating new data loader classes for various data sources
- Implementing a plugin system for easy integration of custom data loaders
- Updating the client interface to work seamlessly with these new data sources
Implementation Plan
-
Database Loaders
- Implement
SQLDatabaseLoader
for relational databases (using SQLAlchemy for broad compatibility) - Implement
MongoDBLoader
for MongoDB (as an example NoSQL loader)
- Implement
-
Cloud Storage Loaders
- Implement
S3Loader
for Amazon S3 - Implement
GCSLoader
for Google Cloud Storage - Implement
AzureBlobLoader
for Azure Blob Storage
- Implement
-
Streaming Data Loaders
- Implement
KafkaLoader
for Apache Kafka - Implement
RabbitMQLoader
for RabbitMQ
- Implement
-
Specialized File Format Loaders
- Implement
ParquetLoader
for Parquet files - Implement
AvroLoader
for Avro files - Implement
HDF5Loader
for HDF5 files
- Implement
-
Plugin System
- Create a base
DataLoaderPlugin
class - Implement a mechanism to discover and load custom data loader plugins
- Create a base
-
Client Interface Updates
- Extend the
IndexifyClient
class to support the new data loaders - Update the
ingest_from_loader
method to work with the new loaders
- Extend the
Code Changes
-
Create new files in
./indexify/data_loaders/
for each new loader, e.g.,sql_loader.py
,s3_loader.py
, etc. -
Update
./indexify/data_loaders/__init__.py
to include the new loaders:
from .local_directory_loader import LocalDirectoryLoader
from .url_loader import UrlLoader
from .sql_loader import SQLDatabaseLoader
from .mongodb_loader import MongoDBLoader
from .s3_loader import S3Loader
# ... (other imports)
- Implement the plugin system in a new file, e.g.,
./indexify/data_loaders/plugin.py
:
class DataLoaderPlugin:
@abstractmethod
def load(self) -> List[FileMetadata]:
pass
@abstractmethod
def read_all_bytes(self, file_metadata: FileMetadata) -> bytes:
pass
- Update the
IndexifyClient
class in./indexify/client.py
to support the new loaders:
class IndexifyClient:
# ...
def ingest_from_loader(self, loader: Union[DataLoader, DataLoaderPlugin], graph: str) -> List[str]:
# Implementation to handle both built-in loaders and plugins