datafusion-comet icon indicating copy to clipboard operation
datafusion-comet copied to clipboard

build: Switch back to released version of DataFusion and arrow-rs after Arrow Java 16 is released

Open viirya opened this issue 1 year ago • 10 comments

Which issue does this PR close?

Closes #248.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

viirya avatar May 08 '24 17:05 viirya

Note that this is blocked by #250. We need to verify Java Arrow after #250 is merged.

viirya avatar May 08 '24 17:05 viirya

The fix (https://github.com/apache/datafusion/pull/10094) is not released yet. So the following error still happens in CI:

CometAggregateSuite:
- count with aggregation filter (371 milliseconds)
- lead/lag should return the default value if the offset row does not exist *** FAILED *** (133 milliseconds)
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 1 times, most recent failure: Lost task 0.0 in stage 32.0 (TID 45) (74d216948a65 executor driver): org.apache.comet.CometNativeException: Arrow error: Invalid argument error: must either specify a row count or at least one column

Need to wait for next release.

viirya avatar May 09 '24 02:05 viirya

Besides, the Java Arrow bug is supported to be fixed in Arrow Java 16. But I still see it happens in a few pipelines:

- multiple column distinct count *** FAILED *** (392 milliseconds)
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 59.0 failed 1 times, most recent failure: Lost task 5.0 in stage 59.0 (TID 117) (74d216948a65 executor driver): org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..

And more interesting is, the test doesn't always have the failure. I also cannot reproduce it locally. Wondering if there are more than one Java Arrow jar is included to randomly it picks up an older version. 🤔

BTW, it also only failed on ubuntu pipelines, mac os-14 pipelines are all okay without the C Data interface error.

viirya avatar May 09 '24 02:05 viirya

Hmm, the error is actually different:

  Cause: org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..

Previously, the fixed one is the buffer at position 1 (offset buffer):

General execution error with reason org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 1 is null...

I need to look at what the buffer is at position 2 and why Java Arrow passes a null buffer again...

viirya avatar May 09 '24 19:05 viirya

Ah, I found that I made a mistake in the Java Arrow PR that it doesn't initiate the offset buffer well. Proposed another issue at Java Arrow https://github.com/apache/arrow/issues/41609

I found it by trying to dump the buffer value:

- multiple column distinct count *** FAILED *** (391 milliseconds)
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 71.0 failed 1 times, most recent failure: Lost task 5.0 in stage 71.0 (TID 125) (ed8fe556741f executor driver): org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer with len = 18446744071618653625 at position 2 is null..

viirya avatar May 09 '24 21:05 viirya

I fixed the issue at arrow-rs in the PR https://github.com/apache/arrow-rs/issues/5756.

We can continue this PR after the fix is released. Keep this as draft for now.

viirya avatar May 13 '24 05:05 viirya

I have verified the fix https://github.com/apache/arrow-rs/issues/5756 can work by cherry-picking it in my forked branch of arrow-rs.

We only need to wait for new release of arrow-rs and DataFusion.

viirya avatar May 14 '24 00:05 viirya

We only need to wait for new release of arrow-rs and DataFusion.

Do we have an estimate on when new releases of arrow-rs and DataFusion will be available? I'm asking because I'm wondering whether to include this in the first release of Comet. It seems correct to use the released version of arrow-rs and DataFusion in the official release of Comet.

advancedxy avatar May 16 '24 14:05 advancedxy

arrow-rs and DataFusion have fast release cycle. We can hold Comet release after the new releases of arrow-rs and DataFusion.

viirya avatar May 16 '24 14:05 viirya

We can hold Comet release after the new releases of arrow-rs and DataFusion.

Thanks for the clarification. I second this.

advancedxy avatar May 16 '24 14:05 advancedxy

I changed from my forked of DataFusion repo to official DataFusion repo in Cargo.toml.

viirya avatar May 30 '24 21:05 viirya

@andygrove I can changed to use official DataFusion repo, but cannot do same for arrow-rs repo. The official DataFusion repo uses arrow-rs release 51.0.0. If I point to arrow-rs repo, rust compiler will complain that two different versions of crate of arrow are used.

But we need one merged patch in arrow-rs which is not released yet.

Thus we still need to wait for next arrow-rs and DataFusion releases.

viirya avatar May 31 '24 00:05 viirya

Switching back to latest DataFusion release/repo causes two issues right now:

  1. C Data interface issue: fixed in arrow-rs. Wait for arrow-rs 52.0.0 and new DataFusion release which uses it.
  2. Aggregate functions (like first/last) issue: https://github.com/apache/datafusion-comet/issues/511

viirya avatar Jun 04 '24 00:06 viirya

Okay. After #511 is fixed, there is only one failures:

  Cause: org.apache.comet.CometNativeException: Fail to process Arrow array with reason C Data interface error: The external buffer at position 2 is null..

which is known and we already fixed it at the upstream. We only need to wait for next DataFusion and arrow-rs releases that should be happened soon.

viirya avatar Jun 06 '24 13:06 viirya

Using 39.0.0-rc1 tag now.

viirya avatar Jun 07 '24 17:06 viirya

All tests are passed!

viirya avatar Jun 07 '24 18:06 viirya

cc @andygrove

viirya avatar Jun 07 '24 18:06 viirya

Thanks @advancedxy @kazuyukitanimura @andygrove

viirya avatar Jun 07 '24 19:06 viirya