parquet-rs icon indicating copy to clipboard operation
parquet-rs copied to clipboard

Add Arrow Support

Open sunchao opened this issue 6 years ago • 8 comments

This is the umbrella ticket to track adding Apache Arrow support. Tasks:

  • [ ] Add Arrow schema converter for read path (#185).
  • [ ] Add Arrow schema converter for write path.
  • [ ] Support reading & writing Arrow in encoder/decoders (#191).
  • [ ] Support record reader for Arrow.
  • [ ] Support record writer for Arrow.
  • [ ] Update documentation for the new feature & how to use.

sunchao avatar Nov 06 '18 07:11 sunchao

I think the next tasks will be:

  • Add reader that reads parquet into arrow.
  • Complete the converter to convert arrow schema to parquet schema.
  • Add writer to save arrow data to parquet format.

liurenjie1024 avatar Nov 06 '18 08:11 liurenjie1024

Thanks @liurenjie1024 . Updated the description for some potential tasks.

sunchao avatar Nov 07 '18 08:11 sunchao

I suggest adding an item to update the existing doc to reflect the addition of arrow reader/writer.

sadikovi avatar Nov 08 '18 12:11 sadikovi

DataFusion has code for loading parquet into arrow ... might be worth looking at

On Thu, Nov 8, 2018 at 4:47 AM Ivan [email protected] wrote:

I suggest adding an item to update the existing doc to reflect the addition of arrow reader/writer.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/sunchao/parquet-rs/issues/186#issuecomment-436982663, or mute the thread https://github.com/notifications/unsubscribe-auth/AA5AxEntUdq27cqNRXZ8yJV8FTxsZMIXks5utCf3gaJpZM4YP3X3 .

andygrove avatar Nov 08 '18 14:11 andygrove

@sadikovi Thanks - added. @andygrove cool - will take a look.

sunchao avatar Nov 08 '18 16:11 sunchao

@andygrove Yes, I'll take that as a reference. Also I'll also reference the cpp implementation of arrow adapter of parquet.

liurenjie1024 avatar Nov 09 '18 02:11 liurenjie1024

I am very interested in this. I am wondering if we can add a generic reader trait to the main arrow project and then have an implementation in parquet-rs.

I have a CSV reader for arrow that could be published as a separate crate and implement the same trait.

andygrove avatar Nov 10 '18 14:11 andygrove

Actually, maybe this is as simple as implementing Iterator<Arc<RecordBatch>>

andygrove avatar Nov 10 '18 16:11 andygrove