datafusion-python icon indicating copy to clipboard operation
datafusion-python copied to clipboard

Should PyDataFrame.collect() return a Table?

Open wjones127 opened this issue 3 years ago • 3 comments
trafficstars

Right now it returns List[pa.RecordBatch], but it might be more natural to return a pa.Table. For one thing, they have a better repr provided by PyArrow.

wjones127 avatar Feb 20 '22 03:02 wjones127

Asides from repr, do you see any other advantages?

matthewmturner avatar Feb 20 '22 05:02 matthewmturner

This is to keep the signature in sync with what we have in the Rust core. Perhaps it would be better to add a new method to return a pa.Table instead.

houqp avatar Feb 20 '22 07:02 houqp

Asides from repr, do you see any other advantages?

Mostly was just surprised coming from PyArrow, but it sounds like Rust usually just represents results as a sequence of record batches.

Perhaps it would be better to add a new method to return a pa.Table instead.

Yeah perhaps that's a better path. A to_table() method is common in PyArrow. If we eventually get the C Streaming data interface implemented in arrow-rs, we could also provide a to_reader().

wjones127 avatar Feb 20 '22 16:02 wjones127