datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Updating arrow2 branch

Open dbr opened this issue 2 years ago • 21 comments

The arrow2 branch is currently using v0.10.0 - I'd like to update it to work with v0.12.0. Is there anything I should know before starting this?

I ask as last time I made a PR for the branch, there was an impending rework of that branch, and I'm not sure what the state of things are currently!

dbr avatar Jun 08 '22 09:06 dbr

Would it make sense to consider having a separate, optional, datafusion-arrow2 crate containing a physical planner + operators that use arrow2? It seems to me that this would be easier to maintain. Maybe @houqp has an opinion on this?

andygrove avatar Jun 08 '22 13:06 andygrove

cc @Igosuki as well who may have some thoughts

alamb avatar Jun 08 '22 18:06 alamb

Hello, sorry for my absence. Maintaining full compatibility requires constant updates, and updating the arrow2 branch is often in the hundred files because of the API surface conflict. So adding a new crate where only arrow2 itself needs to be updated is a good idea in my opinion.

Igosuki avatar Jun 12 '22 07:06 Igosuki

+1 on @andygrove 's idea for isolating changes to a self-contained crate. @dbr if you are interested in that, please feel free to start that project at the datafusion contrib github org. If you are only interested in updating arrow2 version in the current arrow2 branch, then you can just send a PR to the arrow2 branch :)

houqp avatar Jun 13 '22 05:06 houqp

Should we have a discussion/vote about whether this should be in datafusion-contrib rather than under Arrow governance? There are the options of creating a new repo such as arrow-datafusion-arrow2 or a new optional crate within the main repo. The work on this so far has been under Arrow governance so we should make sure that all contributors are in agreement on moving it out and also be aware that moving it back in is a lot of work. In hindsight, I wish we had not moved the DataFusion Python bindings out of Arrow governance after all the work that was done to get them donated in the first place.

andygrove avatar Jun 16 '22 14:06 andygrove

@alamb @jorgecarleitao Do you have opinions on this?

andygrove avatar Jun 16 '22 14:06 andygrove

I do not have any preference, but I also don't really spend any effort on arrow2 or the arrow2 branch in datafusion -- perhaps @Igosuki or others who have expended significant effort would be better to weigh on

alamb avatar Jun 16 '22 14:06 alamb

@andygrove, thanks for the ping! I do not have a preference - I am very happy to support anyone that wishes to use arrow2/parquet2, and I am very thankful for everyone that has been maintaining the arrow2 branch. I unfortunately do not have the time left to spearhead it.

@dbr I would be very happy to support you in the migration - let me know how I can help and feel free to assign tasks directly. Arrow2's core API is essentially stable and that we we usually have small breaking changes in IO, so most of the changes should be on the datafusion's side.

For full transparency: imo the primary question here is one of control - although we demonstrated that arrow2 had superior support, safety, performance, documentation, and UX, I sensed that the main requirement was that arrow2 was donated to Apache, which for me was unacceptable. This demotivated me from contributing here.

I understand that companies that have a strong dependency on DataFusion need control over its core dependency. For me it demos that we are a bit past the idea that people in Apache are individual contributors.

The core development of Polars and Databend have been more pragmatic and I have thus been more aware of the requests there. However, if there is anything needed, just drop an issue and I will gladly take it, irrespectively of where the development happens (here or in a separate repo / org).

I continue to believe that DataFusion would benefit immensely from arrow2's current and future capabilities - arrow2 is the backend of likely the fastest in-node dataframe APIs out there and just this week we exposed a new API that enable compute to be 2x-5x faster ;)

jorgecarleitao avatar Jun 16 '22 19:06 jorgecarleitao

although we demonstrated that arrow2 had superior support, safety, performance, documentation, and UX

I believe there are differing opinions on some of these items

I do think there is broad agreement that the governance model of arrow2 (benign dictator) is not suitable for all users for a variety of reasons.

alamb avatar Jun 16 '22 20:06 alamb

The main issue a few months ago was that the datafusion codebase was not split like it is now, so it created conflicts in the various places where it wasn't possible to confine differences between arrow and arrow2::io to extension traits. So my question is, can a contrib crate exist since arrow2 is essentially required at the core ?

Igosuki avatar Jun 17 '22 08:06 Igosuki

I do think there is broad agreement that the governance model of arrow2 (benign dictator) is not suitable for all users for a variety of reasons.

I agree. Although there is some great work happening in arrow2, I am not motivated (or even permitted) to contribute there. I am heavily invested in working with the Apache Arrow community to continue to improve things here together. The Apache governance model is not perfect but it is working well enough from my point of view. I am a big believer in community over code.

andygrove avatar Jun 17 '22 14:06 andygrove

The main issue a few months ago was that the datafusion codebase was not split like it is now, so it created conflicts in the various places where it wasn't possible to confine differences between arrow and arrow2::io to extension traits. So my question is, can a contrib crate exist since arrow2 is essentially required at the core ?

Regardless of where the code lives, I would view us as having two main options:

  1. Continue to maintain a copy of DataFusion "core" with modifications to allow building with arrow or arrow2. This will be an ongoing maintenance burden and IMO only worth the effort if we are working towards having this merged into DataFusion master at some point and I am not sure how much appetite there is for this.

  2. Fork datafusion-core to create a new query engine based on arrow2 that can leverage the new DataFusion crates for logical plan building, SQL query planning, logical plan optimizations etc. Over time maybe more functionality can be moved out of core and re-used by other query engines. This would decouple the development work and allow those interested in arrow2 to move faster and does not put any burden on the core DataFusiuon development. That would be my choice if I were working on this.

andygrove avatar Jun 17 '22 14:06 andygrove

One approach that @Jimexist started with, that I thought was promising was to use traits to collect the differences in the two apis -- so that DataFusion used the traits and then there were impls for both arrow and arrow2

For example, from_slice in https://github.com/apache/arrow-datafusion/pull/1588 and https://github.com/apache/arrow-datafusion/pull/1589

I would be quite supportive of such a strategy and would be a nice way to let people use which every library they wanted

alamb avatar Jun 18 '22 11:06 alamb

I would be quite supportive of such a strategy and would be a nice way to let people use which every library they wanted

Bonus points if it could one day support a GPU implementation of Arrow as well 😉

andygrove avatar Jun 19 '22 01:06 andygrove

so that DataFusion used the traits and then there were impls for both arrow and arrow2

This indirection might also help to decouple downstream's arrow version with DF's.

waynexia avatar Jun 21 '22 07:06 waynexia

The current arrow2 branch was not created with the intention to keep it as a long running branch, instead, the original plan was to use it as a tool to validate (https://github.com/apache/arrow-datafusion/issues/1652) whether we will be able to get a significant performance boost by switching to arrow2, and if so, convince the rest of the community to make that jump :)

Unfortunately, due to personal reasons, I don't have the time to keep pushing this forward anymore.

I agree with everyone that coming up with a zero cost arrow trait abstraction would be the best path forward, especially considering the possibility of adding arrow-gpu in the future!

houqp avatar Jun 27 '22 05:06 houqp

@alamb AFAIK governance has not been a factor when considering dependencies in DataFusion, but maybe I missed it when that decision taken. Specifically, going through some of its dependencies:

  • sqlparser - unclear governance
  • ahash - unclear governance
  • tokio - Tokio team; could not find governance doc
  • prost - Tokio team; could not find governance doc
  • thrift - clear governance, Apache Arrow
  • parquet-format - unclear governance
  • rayon - unclear governance
  • rand - unclear governance
  • parking_lot - unclear governance
  • smallvec - unclear governance
  • cranelift - clear governance, ByteCodeAlliance
  • avro-rs - unclear governance / donated to Apache but no release yet.
  • parquet - clear governance, Apache Arrow
  • chrono - unclear governance
  • async-trait - unclear governance
  • (under PR) object_store - unclear governance

"arrow2 - unclear governance" is imo a more adequate characterization here - I have been contributing the most to it, so it is natural that I have been taking most of the decisions. Ritchie from Polars, @Dandandan, @alamb and others have full write permissions to the repo; @houqp and Ritchie have permissions to publish to crates.io.

I would be happy to set/modify a governance to arrow2 to accommodate this community's needs, but I would need to know what is the exact requirement, because being part of Apache does not seem to be one today.

Wrt to community over code, I certainly agree with you, @andygrove - I just believe that the Apache Foundation does not own the monopoly of that practice - there are many projects and organizations that fulfill that without forcing projects to a structure like Apache forces, in particular how contributions to some part of the project induce control over the whole project (being PMC).

I agree that going through a trait abstraction is a way. Creating an "arrow2arrow" as suggested in https://github.com/jorgecarleitao/arrow2/issues/629 would also be a way. I would be willing to create and maintain such a crate if that would help this community, but again, we would end up with the same question - it would not be Apache governed, can it be a dependency?

jorgecarleitao avatar Jun 27 '22 13:06 jorgecarleitao

AFAIK governance has not been a factor when considering dependencies in DataFusion

To conflate arrow which is core to both the in-memory layout and query computation, with something fairly self-contained like smallvec, chrono or ahash I think is a wee bit disingenuous :sweat_smile:

It is fairly common for work in arrow-datafusion to build upon/require work in arrow-rs, whereas the same is not true of those listed dependencies. Given we are still talking about this I think we have clearly demonstrated arrow is not a trivial dependency, and therefore warrants some additional care and attention...

was to use traits to collect the differences in the two apis -- so that DataFusion used the traits and then there were impls for both arrow and arrow2

If practicable this sounds like a good idea to me. It would also fit with my ongoing efforts, articulated here, to gradually bring incubated ideas from arrow2 to arrow-rs. I'm actually pretty happy with the progress that has been made so far, and having an easy way to compare the two implementations I think would prove insightful.

In particular I have been inspired by the way the arrow-rs community has worked together to drive progress, through code contributions or otherwise. Yes arrow-rs still has many warts, and is often not particularly pleasant to work on, but there is an active community of people working to make it better in whatever ways they can, and that to me is what open source is all about. For some the governance question is one of legal practicalities, but at least for me it is about having a friendly community working together to deliver great software.

tustvold avatar Jun 27 '22 21:06 tustvold

We have proposed donating object_store to Arrow on https://github.com/influxdata/object_store_rs/issues/41

alamb avatar Jun 28 '22 18:06 alamb

I have filed an issue for adding GPU support: https://github.com/apache/arrow-rs/issues/1966

The proposed arrow-gpu crate differs from arrow and arrow2 because it is necessary to transfer the array data to the GPU before invoking GPU kernels and then the results can be transferred back. It does not make sense to support accessing individual elements in arrays stored on the GPU. However, in other ways they look similar to arrays on the host in that they have data types and you can query the length and whether they have nulls.

As we think about abstractions it would be good to take this into account.

andygrove avatar Jun 29 '22 14:06 andygrove

Lets move back to upgrading arrow2 version in arrow2 branch. I'm quite interested in that, but I found that some unit tests are currently failing:

  • datasource::file_format::json::tests::read_limit
  • datasource::file_format::json::tests::read_small_batches
  • execution::dataframe_impl::tests::register_table
  • optimizer::simplify_expressions::tests::cast_expr_wrong_arg
  • optimizer::simplify_expressions::tests::multiple_now_expr
  • optimizer::simplify_expressions::tests::now_less_than_timestamp
  • optimizer::simplify_expressions::tests::test_const_evaluator_now
  • physical_plan::file_format::json::tests::nd_json_exec_file_projection
  • physical_plan::file_format::json::tests::nd_json_exec_file_with_missing_column
  • physical_plan::file_format::json::tests::nd_json_exec_file_without_projection
  • physical_plan::file_format::parquet::tests::parquet_exec_with_error
  • physical_plan::file_format::parquet::tests::write_parquet_results
  • physical_plan::planner::tests::bad_extension_planner
  • physical_plan::sorts::sort::tests::test_sort_spill

I'd like to first upgrade arrow2 version and then fix these failing tests. Also, upgrading arrow2 to latest version v0.12 means upgrading some dependencies like tonic/parquet-format-async-temp etc, still don't how much work to do. I will open a PR as soon as I finish.

v0y4g3r avatar Jul 07 '22 15:07 v0y4g3r

Given the current discussion about arrow/arrow2 unification (see https://github.com/apache/arrow-rs/issues/1176#issuecomment-1430883886) I think this issue may no longer be relevant.

If no one objects, I'll plan to close it in the next few days

alamb avatar Mar 02 '23 17:03 alamb

Hearing no other comments here, I am closing this one. Feel free to chime in if you disagree or have other comments to add

alamb avatar Mar 06 '23 19:03 alamb