datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Bump arrow2

Open v0y4g3r opened this issue 2 years ago • 6 comments

Which issue does this PR close?

Closes #2709.

Rationale for this change

Current branch arrow2 is still using arrow2 version v0.10 and falls far behind the latest version v0.12

What changes are included in this PR?

Dependencies upgraded:

  • arrow2: v0.10 -> v0.12
  • parquet2: v0.12 -> v0.13
  • arrow-format: 0.4->0.6
  • prost: 0.9 -> 0.10
  • prost-types: 0.9 -> 0.10
  • tonic: 0.6 -> 0.7
  • topic-build: 0.6 -> 0.7

API change:

  • arrow2::error::ArrowError -> arrow2::error::Error
  • arrow2 FileWriter now accepts Vec<Vec<Encoding>> instead of Vec<Encoding> and transverse can be used to create encodings from schema with a customized mapping.
  • arrow2 RowGroupMetadata no longer as a column(usize) method to fetch column metadata at given index, instead, it provides a columns() method that returns a column slice.
  • datafusion::dataframe::DataFrame::write_parquet method now accepts arrow::io::parquet::write::WriteOptions instead of parquet::write::WriteOptions
  • some other minor changes.

Are there any user-facing changes?

Yes, please refer to previous section for full API change list. But I think it's rather easy for users of arrow2 branch to adapt to these changes.

About tests

Some tests are failling in current arrow2 branch, so I'm not going to fix all failing tests in this PR, instead I'll make sure no more unit test fails. I'd be happy to fix all failing tests after this PR is merged so I don't have to fix them twice :)

v0y4g3r avatar Jul 08 '22 06:07 v0y4g3r

@v0y4g3r I'd like to start understanding more about the efforts here. Is the plan to put this behind a cargo feature so that we can switch between arrow and arrow2?

andygrove avatar Jul 08 '22 16:07 andygrove

@v0y4g3r I'd like to start understanding more about the efforts here. Is the plan to put this behind a cargo feature so that we can switch between arrow and arrow2?

No, it's simply about upgrading dependencies.

But I'm also interested in something like defining a trait to hide the API differences between arrow and arrow2 and use cargo feature to provide an option. But I have no ideas on how to implement that now.

There are some license headers missing:

NOT APPROVED: ballista/rust/core/src/memory_stream.rs (./ballista/rust/core/src/memory_stream.rs): false
NOT APPROVED: datafusion/src/physical_plan/sort.rs (./datafusion/src/physical_plan/sort.rs): false
NOT APPROVED: datafusion/tests/sql_integration.rs (./datafusion/tests/sql_integration.rs): false
NOT APPROVED: datafusion-physical-expr/src/test_util.rs (./datafusion-physical-expr/src/test_util.rs): false

Seems like current arrow2 branch doesn't have license header. But it's ok I added headers.

v0y4g3r avatar Jul 09 '22 03:07 v0y4g3r

@andygrove Wondering what I still need to do in order to get this merged and I'd be happy to follow up.

v0y4g3r avatar Jul 11 '22 03:07 v0y4g3r

@andygrove Seems like some workflows need a maintainer's approval to run.

v0y4g3r avatar Jul 14 '22 02:07 v0y4g3r

@v0y4g3r looks like there are some minor build issues that need to fixed.

houqp avatar Jul 18 '22 03:07 houqp

Marking as draft to note it is not ready for merging -- please change it back to ready for review when it is ready. Thanks!

alamb avatar Aug 09 '22 15:08 alamb

@v0y4g3r looks like there are some minor build issues that need to fixed.

These tests are failing in current arrow2 branch, and is not introduced by this PR.

Anyway, I'm working on fixing these tests in another pull request, but there's something different between arrow and arrow2.

v0y4g3r avatar Sep 26 '22 06:09 v0y4g3r

Marking this PR as a draft as I am cleaning up the review queue. Please mark it as ready for review when it is

alamb avatar Oct 12 '22 18:10 alamb

This PR is more than 6 month old, so closing it down for now to clean up the PR list. Please reopen if this is a mistake and you plan to work on it more

alamb avatar Jan 14 '23 11:01 alamb