indexify icon indicating copy to clipboard operation
indexify copied to clipboard

Extended Data Loader Support in Python SDK

Open PulkitMishra opened this issue 4 months ago • 0 comments

Extended Data Loader Support: Add support for more data sources and formats

Current Situation

The Indexify Python SDK currently has limited support for data loading. The main data loader implementations can be found in the ./indexify/data_loaders/ directory. Specifically:

  1. LocalDirectoryLoader: Loads files from a local directory.
  2. UrlLoader: Loads data from URLs.

These are defined in ./indexify/data_loaders/__init__.py:

from .local_directory_loader import LocalDirectoryLoader
from .url_loader import UrlLoader

While these cover basic use cases, they don't address more complex data sources that are common in modern data processing and ML workflows.

Problem

The current data loader support is insufficient for many real-world scenarios, particularly those involving:

  1. Database systems (SQL and NoSQL)
  2. Cloud storage services (e.g., AWS S3, Google Cloud Storage, Azure Blob Storage)
  3. Streaming data sources (e.g., Kafka, RabbitMQ)
  4. Specialized file formats common in data science (e.g., Parquet, Avro, HDF5)

This limitation restricts the types of data sources that can be easily integrated into Indexify workflows, potentially forcing users to implement custom loaders or preprocess data before using Indexify.

Proposed Solution

Extend the data loader support to include a wider range of data sources and formats. This will involve:

  1. Creating new data loader classes for various data sources
  2. Implementing a plugin system for easy integration of custom data loaders
  3. Updating the client interface to work seamlessly with these new data sources

Implementation Plan

  1. Database Loaders

    • Implement SQLDatabaseLoader for relational databases (using SQLAlchemy for broad compatibility)
    • Implement MongoDBLoader for MongoDB (as an example NoSQL loader)
  2. Cloud Storage Loaders

    • Implement S3Loader for Amazon S3
    • Implement GCSLoader for Google Cloud Storage
    • Implement AzureBlobLoader for Azure Blob Storage
  3. Streaming Data Loaders

    • Implement KafkaLoader for Apache Kafka
    • Implement RabbitMQLoader for RabbitMQ
  4. Specialized File Format Loaders

    • Implement ParquetLoader for Parquet files
    • Implement AvroLoader for Avro files
    • Implement HDF5Loader for HDF5 files
  5. Plugin System

    • Create a base DataLoaderPlugin class
    • Implement a mechanism to discover and load custom data loader plugins
  6. Client Interface Updates

    • Extend the IndexifyClient class to support the new data loaders
    • Update the ingest_from_loader method to work with the new loaders

Code Changes

  1. Create new files in ./indexify/data_loaders/ for each new loader, e.g., sql_loader.py, s3_loader.py, etc.

  2. Update ./indexify/data_loaders/__init__.py to include the new loaders:

from .local_directory_loader import LocalDirectoryLoader
from .url_loader import UrlLoader
from .sql_loader import SQLDatabaseLoader
from .mongodb_loader import MongoDBLoader
from .s3_loader import S3Loader
# ... (other imports)
  1. Implement the plugin system in a new file, e.g., ./indexify/data_loaders/plugin.py:
class DataLoaderPlugin:
    @abstractmethod
    def load(self) -> List[FileMetadata]:
        pass

    @abstractmethod
    def read_all_bytes(self, file_metadata: FileMetadata) -> bytes:
        pass
  1. Update the IndexifyClient class in ./indexify/client.py to support the new loaders:
class IndexifyClient:
    # ...
    def ingest_from_loader(self, loader: Union[DataLoader, DataLoaderPlugin], graph: str) -> List[str]:
        # Implementation to handle both built-in loaders and plugins

PulkitMishra avatar Oct 01 '24 06:10 PulkitMishra