delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

Expose a python method to use RecordBatchReader instead of PyArrow Dataset

Open aersam opened this issue 2 years ago • 5 comments

Description

It's impossible to use PyArrow Dataset to represent Column Mapping (https://github.com/apache/arrow/issues/36593), also Deletion Vectors are nothing to be represented in a Dataset. PyArrow Tables are more flexibel, but fully loaded into RAM. I think the correct abstraction would be a to_recordbatchreader() method on the Delta Tables which takes a (partition)filter parameter

Use Case

Future support for deletion vectors and column mapping

aersam avatar Nov 07 '23 08:11 aersam

@aersam I don't see in the pyarrow docs how we can read recordbatches from storage with an RecordBatchReader. I only see a path from dataset.to_batches() and then build the reader with these batches

ion-elgreco avatar Nov 08 '23 18:11 ion-elgreco

Well I would also recommend implementing it in Rust: https://arrow.apache.org/rust/arrow/record_batch/trait.RecordBatchReader.html

But the thing is that a RecordBatchReader can be constructed from anything, from either Rust or PyArrow. It's a very generic abstraction, it's only a Schema and an Iterator over RecordBatches

aersam avatar Nov 08 '23 19:11 aersam

Hi @aersam. You are right that PyArrow datasets right now will be a dead end as we move to support deletion vectors, column mapping, and other new features. I've been meaning to define a new protocol that will allow us to expose something like a PyArrow Dataset, but that we can create a custom implementation of in Rust. This is tracked in https://github.com/apache/arrow/issues/37504

In the near term though, it does seems like it might be appropriate to expose a method like:

def scan(
   self,
   columns: Optional[List[str]] = None,
   filter: Optional[???] = None,
) -> pa.RecordBatchReader:
    ...

And implement that with a Rust-based scanner that supports newer table features.

wjones127 avatar Nov 08 '23 21:11 wjones127

@wjones127 You mean this: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner ?

ion-elgreco avatar Nov 11 '23 12:11 ion-elgreco

@ion-elgreco That is implemented in C++ and something we don't have control over. But yes, it would have many similarities to that.

wjones127 avatar Nov 11 '23 20:11 wjones127