arrow2 [question] Questions regarding project aim

[question] Questions regarding project aim

Open ritchie46 opened this issue 3 years ago • 9 comments

Not gone through the source yet, so I hope I don't ask dumb things.

This is very cool and probably takes a way a lot of the gripes in the current implementation. I also think this can help compile times quite a lot.

As the memory model of arrow is stable. Do you think it would be possible to define traits for Arrow2 arrays and default arrow Arrays that make it possible to generalize the existing kernels between both crates? Or is that a ball and chain you don't want to have for this repo (which I can imagine).

And do you aim publishing to crates.io? :)

Mar 29 '21 17:03 ritchie46

Thank you for your interest: great admirer of your work in Polars!

Also great questions and ideas. Regarding traits: technically, it should be possible. I think that we we would need to declare a fair bit of them; for example arrow Bitmaps are not equal to arrow2 bitmaps, and thus we would need a trait for them (e.g. when we read validities from arrays). Buffers are also different: in arrow every buffer is Buffer<u8>, but in arrow2 buffers are typed. Thus, a trait that returns a "buffer" is not easy to achieve (which is required when a kernel clones a buffer, e.g. casting).

We would also need to declare them somewhere: on arrow or arrow2 or a third repo, arrow-traits, and then implement them on both sides. If we place them in the arrow repo, this crate now depends on a large dependency (arrow). The traits' release cycle is also tied to arrow release cycle. If we place it here, then arrow would need to depend on arrow2. If we place it on a separate repo, who manages and takes decisions over it?

My wish is that we would move this to the Apache Arrow project to an separate repo and release cycle independent of everything else (like a normal crate).

Note that this crate's high-end API is very similar to arrow, thus, in terms of migration, it should not be a major issue. I know exactly what to do to migrate DataFusion to it. If you think it is worth, I would gladly take a shot at working it out in Polars.

My plan was to have parquet support before doing any work on integration as IMO parquet IO is a major feature missing here and any work would be blocked by that.

It happens that parquet has its fair share of non-trivial transmutations; I am currently evaluating whether there is a way out of that. I do have an experimental branch that supports primitive types here using parquet crate, but I would prefer to avoid depending on a crate which I sense may have some risk of buffer overflows and/or out of bound reads (which IMO are direct CVE material).

I could also finish the parquet implementation and deal with the safety and performance later. Any suggestions?

And do you aim publishing to crates.io? :)

Nothing stopping us. My idea was to use pre-1 semver 2.0 here, 0.X.Y where X represents backward incompatibility and Y is minor and patch (backward compatible).

Mar 29 '21 18:03 jorgecarleitao

Thank you for your interest: great admirer of your work in Polars!

Thank you, I think I can say the same because I learned a lot from the Arrow project. :smile:

My wish is that we would move this to the Apache Arrow project to an separate repo and release cycle independent of everything else (like a normal crate).

That would be great indeed. Another gripe is the slow release cycle, and no patches.

Regarding the traits.. Yeah I fully understand the complexities there. Maybe its better to stay free from those complexities. Currently you have a clean slate.

Note that this crate's high-end API is very similar to arrow, thus, in terms of migration, it should not be a major issue. I know exactly what to do to migrate DataFusion to it. If you think it is worth, I would gladly take a shot at working it out in Polars.

I would be very thankful for any advice or help making Polars ready for such a change. I believe that the complexity is in ChunkedArray<T> where T can be a PolarsPrimitive: ArrowPrimitive. So there is a dependency there.

I believe the separation of the logical part and physical part truly is worth the refactor. Currently, if I compile for all Series, compilation take very long due to all the duplicates in generics, e.g. Date32 being physically the same as i32.

whilst I am typing this I come to the conclusion that the beforementioned dependency should not matter, and could be abstracted away. Still I could learn from you separation of physical, / logical types!

I could also finish the parquet implementation and deal with the safety and performance later. Any suggestions?

Make it work, make it right, make it fast? :smile:

Mar 30 '21 14:03 ritchie46

@jorgecarleitao I thought the overall aim of arrow2 was to upstream it to the main arrow repo. Don't you think that publishing this crate and having people rely on it will create more confusion than anything and risk splitting the community?

The release process was discussed a few times on the mailing list and don't think there is anything stopping us (Rust) releasing more frequently, although in practice it's cumbersome.

I know you were wanting to stop using JIRA, etc. in Arrow and I agree that they do impose some hardship, I was hoping your proposal to move to GitHub would be accepted but overall I think keeping the community intact is more important.

Mar 30 '21 19:03 paddyhoran

@ritchie46 , perfect.

I would be very thankful for any advice or help making Polars ready for such a change. I believe that the complexity is in ChunkedArray<T> where T can be a PolarsPrimitive: ArrowPrimitive. So there is a dependency there.

So, the original arrow crate has a bunch of traits Native, ArrowPrimitive, ArrowNumeric, etc. They represent a combination of Logical and physical types (the const DATA_TYPE is the logical, its in-memory representation the physical) and other numeric that are not really needed.

Arrow2 has one type, NativeType, for primitive arrays. It represents any byte-like type that can be represented in Rust (and thus can be interpreted from bytes). All native Rust types that are part of the Arrow spec implement NativeType.

The consequence is that matches over DataType are always done to differentiate:

physical representations
physical operations

as either of them inevitably generate a different assembly / set of instructions.

In practice, this amounts to

match array.data_type() {
     DataType::Boolean => array.as_any().downcast_ref::<Boolean>().unwrap() ...
     DataType::Date32 | Time32(_) => array.as_any().downcast_ref::<PrimitiveArray<i32>>().unwrap(); op specific for any date32 or time32
     DataType::Date64 => array.as_any().downcast_ref::<PrimitiveArray<i64>>().unwrap(); op specific for date64
     DataType::Timestamp(_, _) => array.as_any().downcast_ref::<PrimitiveArray<i64>>().unwrap(); Op that applies to all time units
}

I.e. no more "TimestampMillisecondArray" and the like; they are all PrimitiveArray<i64> (since their physical representation is the same and thus the assembly code is the same, their only differ in the DataType). There is a map between a logical type and a physical type, which I placed as part of the guide.

Make it work, make it right, make it fast? 😄

Roger roger :)

I was able to re-write all functionality without using unsafe and AFAIK without jeopardizing performance, which gives me confidence that things can be indeed done in a safe manner. I can already read to native rust (Vec<Option<i32>>). I am now writing the logic to map parquet's representation directly into arrow.

I have not started writing the parquet writer. Is this something already supported by Polars? If yes, then it is something I also need to work, to ensure feature parity.

Apr 03 '21 06:04 jorgecarleitao

@paddyhoran , thank you for your comment and points. I understand that sentiment, which, perhaps paradoxically, I also share.

The release process was discussed a few times on the mailing list and don't think there is anything stopping us (Rust) releasing more frequently, although in practice it's cumbersome.

Patch releases are still tied to all implementations. We still release a major version every X months with all others, release all cargos at once, have no minor releases and a bunch of other practical problems that I raised on the mailing list.

Don't you think that publishing this crate and having people rely on it will create more confusion than anything and risk splitting the community?

but overall I think keeping the community intact is more important.

With +50k -40k code committed to the arrow repo, the Rust implementation of arrow and DataFusion on the main repo is something very dear to me. It has been my personal project over the past 8 months. Not wanting to merge this upstream for now was not a decision that I took lightly; it generated a fair amount of stress in my life over the past 2 months.

Arguing / pressuring that I should not release a Rust implementation of the Arrow specification outside of the Apache Arrow project at the expense of splitting the community is imo a bit unrealistic in an open source context. For me, it would be like asking @ritchie46 to donate Polars to Arrow because it otherwise splits the community around DataFusion and Polars.

IMO the community remains intact for as long as we respect and help each other, which imo has more to do with whether we listen and take our members' opinions into account. Where the code lives is typically a by-product of such behaviors. The fact that I am doing this should be interpreted as a consequence, not cause, of some of my frustrations, which I hope I was sufficiently transparent about in the mailing list.

If other projects are considering depending on this crate, IMO we should offer them the right tools (SemVer and a package manager). I am supportive of this because I know that the arrow crate has a serious soundness and security design problem that will only be addressed with a major redesign, and IMO it is more important to have projects in Rust relying on safe and sound implementations with good development and release processes than to have this code under the Apache Arrow repo. Others may of course disagree with the weights on the equation, and are welcome to address those security concerns within the arrow repo.

So, like I said,

My wish is that we would move this to the Apache Arrow project to an separate repo and release cycle independent of everything else (like a normal crate).

However, given the responses I got in the mailing list over these concerns, it is not something in my top list of priorities atm. I am more concerned in writing an implementation for reading and writing parquet that is safe and fast so that arrow2 has the same feature parity as arrow with a design that leverages Rust's compiler for checking memory-management invariants.

Apr 03 '21 07:04 jorgecarleitao

Arguing / pressuring that I should not release a Rust implementation of the Arrow specification outside of the Apache Arrow project at the expense of splitting the community is imo a bit unrealistic in an open source context. For me, it would be like asking @ritchie46 to donate Polars to Arrow because it otherwise splits the community around DataFusion and Polars.

I'm not trying to pressure you into anything. It's unfortunate that an Arrow PMC is working on an alternative implementation outside the main community (as defined within Apache) though. I disagree with your comparison to Polars but I do appreciate and understand the point to you trying to make.

and IMO it is more important to have projects in Rust relying on safe and sound implementations with good development and release processes than to have this code under the Apache Arrow repo

You are mentioning "safe and sound implementations" but there is no disagreement here. I don't think anyone in the community is married to the current implementation (though I probably should not speak for @andygrove, @alamb, @nevi-me, etc.) and all of us would trust you to refactor the current implementation. I believe your main issues are with JIRA and the release process. Again, I'm in agreement with you here, I just don't feel as strongly as you do :).

In short, I understand your reasons but I hope that you are able to upstream this implementation to the main arrow repo (I really like some of the changes you have made).

Apr 06 '21 20:04 paddyhoran

Thanks for the ping @paddyhoran and I'm excited to see some of the amazing work that @jorgecarleitao is doing here and I would very much like to see this upstreamed to the official Apache Arrow repo. I would love to be able to use this from DataFusion and Ballista. I think it could be game-changing in terms of performance and safety.

I'm nearly through the process of contributing Ballista back to Arrow and it has been a slightly painful/stressful process so if there is a plan to contribute this back then I would advise:

Make sure there are CI checks for ASL (not ASF) headers in the project
Make sure that all contributors are onboard with contributing this back and prepared to sign CLAs when the time comes (and maybe consider only accepting future contributions from those that already have Apache CLAs filed)

Given that I myself have gone off and developed something new in a separate repo, I obviously understand the appeal and the need to do this for major undertakings. You can't do this kind of project one PR at a time.

On the subjects of JIRA, release process, release versioning, monorepo, etc, I also would like to see changes here, and although we have made some progress (like voting on source releases for patch releases) I acknowledge there is more to do. I personally have not been able to find the time to help drive this forward so far but once Ballista is merged I will have more motivation to work on this.

Apr 06 '21 21:04 andygrove

I for one would be very supportive of getting some/all of this implementation back into the main arrow/rust implementation. I realize doing so is lots of work -- not just technically but also JIRA / process wise.

I don't especially enjoy doing the JIRA dance for the main apache repo, but my judgement is that the apache/arrow crate is the best, most full featured implementation of Arrow in Rust, and it has the biggest community and potential to attract new contributors, and so it is where I want to spend my time. As @andygrove said, projects at different phases of their lives are best served by processes that tailored.

I truly hope some / all of the great innovation @jorgecarleitao is doing in this repo can be used and appreciated by a wider community: I think it will be under appreciated / under used if it stays as a separate implementation.

Anyhow, let me know what I can do to help. There is so much great work going on.

Apr 07 '21 00:04 alamb

I.e. no more "TimestampMillisecondArray" and the like; they are all PrimitiveArray (since their physical representation is the same and thus the assembly code is the same, their only differ in the DataType). There is a map between a logical type and a physical type, which I placed as part of the guide.

Yes, I've read it, and it make total sense. Another benefit is the reduction of code bloat. The cast kernel is huge!

I have not started writing the parquet writer. Is this something already supported by Polars? If yes, then it is something I also need to work, to ensure feature parity.

Yes, it is, but I see you already are rocking that as well.. :smile:

With regard to the second discussion that has started on this thread. I believe that as long we are staying compatible with the arrow memory format, a crate that does semantic versioning, regular patch releases etc., is beneficial to the arrow community and acceptance.

At the moment, (unless you are in the main repo), being dependent on Arrow's official released Rust sources (e.g. crates.io) is quite hard. For Polars I started following github master branch, because I was afraid I couldn't get the refactors working otherwise. I believe that regular updated crates, with semantic versioning can make the whole Rust Arrow community bigger, instead of splitting it.

Apr 10 '21 10:04 ritchie46

arrow2 arrow2 copied to clipboard

[question] Questions regarding project aim

arrow2
arrow2 copied to clipboard