rust-dataframe icon indicating copy to clipboard operation
rust-dataframe copied to clipboard

extensibitlity of data source

Open houqp opened this issue 5 years ago • 5 comments

I am experimenting with evaluating lazy frame with a custom data source. However, looks like Reader being declared as a struct makes it hard to add support for custom data source that shouldn't be part of the dataframe core code base.

Would it make sense to change Reader and Writer into traits so that custom data source implementations can be fully decoupled from the core code base?

houqp avatar May 03 '20 23:05 houqp

For lazy reads, what matters are being able to get the schema (easy) and bring able to express predicate push-down if supported. The DataSourceEval does the first, but I hadn't gotten to the second.

It's fine if we start with just a trait that doesn't express predicate push-down, as long as the solution leaves scope for it being done in future.

Given that DataFrame takes in a &Reader and creates a concrete Arrow reader for the various formats, we could use arrow::record_batch::RecordBatchReader as the trait for custom sources. We ultimately want to consume a bunch of record batches to create the dataframe.

I'm typing this on my phone (power outage), but I'll update my comment later when I'm on a desktop.

nevi-me avatar May 04 '20 09:05 nevi-me

if we want to consider predicate push-down as part of the initial design, then perhaps it should be a higher level trait that provides access to both RecordBatchReader as well as the predicates?

houqp avatar May 05 '20 00:05 houqp

if we want to consider predicate push-down as part of the initial design, then perhaps it should be a higher level trait that provides access to both RecordBatchReader as well as the predicates?

Yes, I'm thinking along the lines of creating a trait that allows data sources to declare their capabilities (projection, filtering, etc), then applying optimisations to them based on such capability.

nevi-me avatar Jun 06 '20 07:06 nevi-me

Hey @houqp, I spent some time trying to find a solution to this. The current #[derive(Serialize, Deserialize, Debug, Clone)] on crate::expression::Reader makes me unable to pass a trait that implements a reader/writer (also tried boxing it).

I haven't explored the idea of a datasource registry, but maybe that could work. I'm thinking of instead using Arrow Flight for custom datasources. The main downside in the short term is that a custom source would need to implement the gRPC protocol + some defined extensions that would allow determining capabilities.

Spark's DataSource API benefits from runtime reflection to get the correct classes, I'm not good enough with Rust's data types to find equivalent solutions that would work :(


I wanted to implement something like:

pub trait DataSource: INSERT_BOUNDS {
    fn get_dataset(&self) -> Result<Dataset>;
    fn source(&self) -> DataSourceType;
    fn format(&self) -> &str;
    fn schema(&self) -> arrow::datatypes::SchemaRef;
    fn next_batch(&mut self) -> Result<Option<RecordBatch>>;

    fn supports_projection(&self) -> bool {
        false
    }
    fn supports_filtering(&self) -> bool {
        false
    }
    fn supports_sorting(&self) -> bool {
        false
    }
    fn supports_limit(&self) -> bool {
        false
    }

    fn limit(&mut self, limit: usize) -> Result<()>;
    fn filter(&mut self, filter: BooleanFilter) -> Result<()>;
    fn project(&mut self, columns: Vec<String>) -> Result<()>;
    fn sort(&mut self, criteria: Vec<SortCriteria>) -> Result<()>;
}

nevi-me avatar Jul 27 '20 10:07 nevi-me

I think Arrow Flight would be an overkill for this use-case. As for the proposed trait, wouldn't it make more sense to move the responsibility of performing operations like filter and limit into dataframe struct itself?

houqp avatar Aug 02 '20 21:08 houqp