datafusion-comet
datafusion-comet copied to clipboard
build: Switch back to released version of DataFusion and arrow-rs after Arrow Java 16 is released
Which issue does this PR close?
Closes #248.
Rationale for this change
What changes are included in this PR?
How are these changes tested?
Note that this is blocked by #250. We need to verify Java Arrow after #250 is merged.
The fix (https://github.com/apache/datafusion/pull/10094) is not released yet. So the following error still happens in CI:
CometAggregateSuite:
- count with aggregation filter (371 milliseconds)
- lead/lag should return the default value if the offset row does not exist *** FAILED *** (133 milliseconds)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 32.0 (TID 45) (74d216948a65 executor driver): org.apache.comet.CometNativeException: Arrow error: Invalid argument error: must either specify a row count or at least one column
Need to wait for next release.
Besides, the Java Arrow bug is supported to be fixed in Arrow Java 16. But I still see it happens in a few pipelines:
- multiple column distinct count *** FAILED *** (392 milliseconds)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 59.0 failed 1 times, most recent failure: Lost task 5.0 in stage 59.0 (TID 117) (74d216948a65 executor driver): org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..
And more interesting is, the test doesn't always have the failure. I also cannot reproduce it locally. Wondering if there are more than one Java Arrow jar is included to randomly it picks up an older version. 🤔
BTW, it also only failed on ubuntu pipelines, mac os-14 pipelines are all okay without the C Data interface error.
Hmm, the error is actually different:
Cause: org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..
Previously, the fixed one is the buffer at position 1 (offset buffer):
General execution error with reason org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 1 is null...
I need to look at what the buffer is at position 2 and why Java Arrow passes a null buffer again...
Ah, I found that I made a mistake in the Java Arrow PR that it doesn't initiate the offset buffer well. Proposed another issue at Java Arrow https://github.com/apache/arrow/issues/41609
I found it by trying to dump the buffer value:
- multiple column distinct count *** FAILED *** (391 milliseconds)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 71.0 failed 1 times, most recent failure: Lost task 5.0 in stage 71.0 (TID 125) (ed8fe556741f executor driver): org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer with len = 18446744071618653625 at position 2 is null..
I fixed the issue at arrow-rs in the PR https://github.com/apache/arrow-rs/issues/5756.
We can continue this PR after the fix is released. Keep this as draft for now.
I have verified the fix https://github.com/apache/arrow-rs/issues/5756 can work by cherry-picking it in my forked branch of arrow-rs.
We only need to wait for new release of arrow-rs and DataFusion.
We only need to wait for new release of arrow-rs and DataFusion.
Do we have an estimate on when new releases of arrow-rs and DataFusion will be available? I'm asking because I'm wondering whether to include this in the first release of Comet. It seems correct to use the released version of arrow-rs and DataFusion in the official release of Comet.
arrow-rs and DataFusion have fast release cycle. We can hold Comet release after the new releases of arrow-rs and DataFusion.
We can hold Comet release after the new releases of arrow-rs and DataFusion.
Thanks for the clarification. I second this.
I changed from my forked of DataFusion repo to official DataFusion repo in Cargo.toml.
@andygrove I can changed to use official DataFusion repo, but cannot do same for arrow-rs repo. The official DataFusion repo uses arrow-rs release 51.0.0. If I point to arrow-rs repo, rust compiler will complain that two different versions of crate of arrow are used.
But we need one merged patch in arrow-rs which is not released yet.
Thus we still need to wait for next arrow-rs and DataFusion releases.
Switching back to latest DataFusion release/repo causes two issues right now:
- C Data interface issue: fixed in arrow-rs. Wait for arrow-rs 52.0.0 and new DataFusion release which uses it.
Aggregate functions (likefirst/last) issue: https://github.com/apache/datafusion-comet/issues/511
Okay. After #511 is fixed, there is only one failures:
Cause: org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..
which is known and we already fixed it at the upstream. We only need to wait for next DataFusion and arrow-rs releases that should be happened soon.
Using 39.0.0-rc1 tag now.
All tests are passed!
cc @andygrove
Thanks @advancedxy @kazuyukitanimura @andygrove