delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

WIP: parquet2 implementation backed by parquet2 feature gate

Open houqp opened this issue 2 years ago • 5 comments

decouple core from arrow

Description

WIP parquet2 implementation. The goal of this PR is to implement full read support leveraging parquet2. Write support is out of the scope and should be added as follow up PRs. Arrow2 integration is also out of the scope and should be added through follow up PR.

Currently all read tests are passing:

cargo test --no-default-features --features=arrow2,parquet2 

Todo:

  • [ ] clean up duplicated code
  • [ ] support parsing map type
  • [ ] support parsing list type
  • [ ] benchmark

Related Issue(s)

blocks #310

Documentation

houqp avatar Oct 17 '21 23:10 houqp

@ritchie46 let me know what you think about this approach. The core library is now fully decoupled from the arrow-rs crate and only depends on parquet2 for checkpoint parsing.

As a consumer, i.e. polars, you should be able to use it with --no-default-features --features=parquet2. arrow integration only provides support for schema conversion between delta table schema and arrow schema, which is not very useful for polars. You might be better off just using the schema from the raw parquet file for now, see https://github.com/delta-io/delta-rs/issues/441.

If you are ok with this design, we can collaborate on the qp_arrow2 branch to finish up the PoC.

houqp avatar Oct 17 '21 23:10 houqp

If you are ok with this design, we can collaborate on the qp_arrow2 branch to finish up the PoC.

I don't understand this library enough yet to fully qualify this. But if anything comes during polars integration, I hope I can make suggestions. As I said, kudos for being able to feature gate such a core dependency.

ritchie46 avatar Oct 18 '21 07:10 ritchie46

Sounds good @ritchie46 , I will complete the parquet parsing support for map and list this weekend. But the branch I have here right now should be enough to unblock ploars integration.

houqp avatar Oct 19 '21 05:10 houqp

@houqp, any updates on this?

andrei-ionescu avatar Jul 01 '22 16:07 andrei-ionescu

@andrei-ionescu I have implemented all the data types other than map and nested list, so it's very close to be complete. However, my time is limited now, so progress will be slow. Anyone if welcome to collaborate on this branch to push this over the finish line :)

houqp avatar Jul 17 '22 23:07 houqp

alright, this branch is now feature complete ;) now it's time to catch up to latest delta-rs main branch and arrow2/parquet2 releases.

houqp avatar Aug 22 '22 01:08 houqp

@wjones127 @roeap ready for review.

houqp avatar Aug 29 '22 01:08 houqp

maybe it would be cleaner to have the parquet specific stuff also moved into its own mod

Good idea, I have moved all those code into a parquet_read mod to keep action lean.

I chatted this with Andy and Jorge at the Data+AI summit a couple of months ago, the long term goal is to develop an Arrow trait that allows users to switch between arrow-rs and arrow2 in different projects including datafusion and delta-rs. This will also open up the possibility for a 3rd GPU based arrow implementation ;)

houqp avatar Aug 30 '22 05:08 houqp