Expose a python method to use RecordBatchReader instead of PyArrow Dataset
Description
It's impossible to use PyArrow Dataset to represent Column Mapping (https://github.com/apache/arrow/issues/36593), also Deletion Vectors are nothing to be represented in a Dataset. PyArrow Tables are more flexibel, but fully loaded into RAM. I think the correct abstraction would be a to_recordbatchreader() method on the Delta Tables which takes a (partition)filter parameter
Use Case
Future support for deletion vectors and column mapping
@aersam I don't see in the pyarrow docs how we can read recordbatches from storage with an RecordBatchReader. I only see a path from dataset.to_batches() and then build the reader with these batches
Well I would also recommend implementing it in Rust: https://arrow.apache.org/rust/arrow/record_batch/trait.RecordBatchReader.html
But the thing is that a RecordBatchReader can be constructed from anything, from either Rust or PyArrow. It's a very generic abstraction, it's only a Schema and an Iterator over RecordBatches
Hi @aersam. You are right that PyArrow datasets right now will be a dead end as we move to support deletion vectors, column mapping, and other new features. I've been meaning to define a new protocol that will allow us to expose something like a PyArrow Dataset, but that we can create a custom implementation of in Rust. This is tracked in https://github.com/apache/arrow/issues/37504
In the near term though, it does seems like it might be appropriate to expose a method like:
def scan(
self,
columns: Optional[List[str]] = None,
filter: Optional[???] = None,
) -> pa.RecordBatchReader:
...
And implement that with a Rust-based scanner that supports newer table features.
@wjones127 You mean this: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.scanner ?
@ion-elgreco That is implemented in C++ and something we don't have control over. But yes, it would have many similarities to that.