delta-rs
delta-rs copied to clipboard
chore: upgrade to datafusion 43
Description
Bump kernel
- closes https://github.com/delta-io/delta-rs/issues/2850
We need Datafusion to upgrade as well before we can do this. I have some patches to make the latest datafusion work but we're blocked on them to adopt the newest arrow and kernel
Hi, datafusion has been released. Let's rock!
Thank you @ion-elgreco -- let me know if there is anything I can to do help with this PR (e.g. I could work on updating the code to use non deprecated APIs in the core for parquet statistics...)
Hey @alamb, I am not working on this atm, but @rtyler has picked it up from here
@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.
A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.
I had a quick glance at the Changelog but can't really pinpoint where this change may have occurred, do you have any ideas?
DFSchema { inner:
Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
new schema:
DFSchema { inner: Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },
@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.
A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.
It sounds somewhat similar to a discussion happening here about nullability (but I think field name is similar in terms of how it is treated): https://github.com/apache/datafusion/pull/11989#issuecomment-2360601240
Update; filed https://github.com/apache/datafusion/issues/12560 to track the regression more
@alamb thanks for following up! I feel a little sheepish :sheep: insofar that my internal CI showed these errors weeks ago in my integration testing between datafusion main with delta-rs main, I just didn't have the time to triage them like @ion-elgreco has :smile:
I've subscribed to the issue you created and can help testing if necessary! Depending on my availability this weekend I can try fixing it in datafusion too :thinking:
Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.
Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.
It's not yet released though
I can confirm that the latest datafusion main passes tests in CI :partying_face:
fix
I filed https://github.com/apache/datafusion/issues/12813 to consider releasing a patch version of DataFusion
Hi, is the title of this PR incorrect? We're going to DF 42, right?
Hi, is the title of this PR incorrect? We're going to DF 42, right?
The title is not incorrect :laughing: There will be no deltalake release with datafusion 42 because of the regression identified. Additionally it doesn't look like the datafusion community is going to create a 42.1.0, which means we have no choice but to skip 42 and wait for 43 :sob:
The title is not incorrect 😆 There will be no deltalake release with datafusion 42 because of the regression identified. Additionally it doesn't look like the datafusion community is going to create a 42.1.0, which means we have no choice but to skip 42 and wait for 43 😭
Understood. It's somewhat sad that our efforts to upgrade to Arrow 53 have been blocked by this issue.
Additionally it doesn't look like the datafusion community is going to create a 42.1.0, which means we have no choice but to skip 42 and wait for 43 😭
One alternate would be for someone to help us make a 42.1.0 (no one volunteered on https://github.com/apache/datafusion/issues/12813)
One alternate would be for someone to help us make a 42.1.0 (no one volunteered on apache/datafusion#12813)
Hi, I'm willing to help for this.
One alternate would be for someone to help us make a 42.1.0 (no one volunteered on apache/datafusion#12813)
Hi, I'm willing to help for this.
Update here is that we have found some help and will try and do this release sometime later this week or early next week
Updates: we only need one more binding vote to make this release.
Updates: we only need one more binding vote to make this release.
❤️ -- I think we have the votes, I am just now waiting for the 72 hour timeline to elapse and then I will publish the release
❤️ -- I think we have the votes, I am just now waiting for the 72 hour timeline to elapse and then I will publish the release
That's great. I've been waiting for this for so long. Happy that I can contribute to pushing this forward.
https://crates.io/crates/datafusion/42.1.0 is now available
There is still one failing test, which should be easy to fix. I don't have access to my PC though to resolve it, @Xuanwo would you perhaps be able to push to this branch?
There is still one failing test, which should be easy to fix. I don't have access to my PC though to resolve it, @Xuanwo would you perhaps be able to push to this branch?
I would really like to, but I'm not a committer for delta-rs. Perhaps we should seek assistance from @rtyler.
There is still one failing test, which should be easy to fix. I don't have access to my PC though to resolve it, @Xuanwo would you perhaps be able to push to this branch?
I would really like to, but I'm not a committer for delta-rs. Perhaps we should seek assistance from @rtyler.
You could potentially make a PR (in the https://github.com/ion-elgreco/delta-rs fork) that targets this branch that @ion-elgreco could merge in
However, isn't it (very) late in your timezone 💤 🤔
@ion-elgreco I fixed your test you just have to merge https://github.com/ion-elgreco/delta-rs/pull/4
Codecov Report
Attention: Patch coverage is 82.59494% with 55 lines in your changes missing coverage. Please review.
Project coverage is 72.46%. Comparing base (
0cb091c) to head (b144cf1). Report is 14 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2886 +/- ##
==========================================
+ Coverage 72.39% 72.46% +0.07%
==========================================
Files 128 128
Lines 40530 40823 +293
Branches 40530 40823 +293
==========================================
+ Hits 29341 29583 +242
- Misses 9306 9356 +50
- Partials 1883 1884 +1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
🚨 Try these New Features:
- Flaky Tests Detection - Detect and resolve failed and flaky tests
- JS Bundle Analysis - Avoid shipping oversized bundles
However, isn't it (very) late in your timezone 💤 🤔
Haha, indeed. This was the last message I sent yesterday.
@ion-elgreco I fixed your test you just have to merge https://github.com/ion-elgreco/delta-rs/pull/4
Thank you @hntd187 for making progress.
Unfortunately still one failure: https://github.com/delta-io/delta-rs/actions/runs/11429877833/job/31796752448#step:5:858
Hi, there appears to be a strange issue: items_in_bucket is a list. Its field used to be named element, but it has now been changed to item.
Any ideas to this? cc @ion-elgreco @alamb
Hi, there appears to be a strange issue:
items_in_bucketis a list. Its field used to be namedelement, but it has now been changed toitem.Any ideas to this? cc @ion-elgreco @alamb
We switched in kernel-rs to use parquet structure for these arrow types. But Datafusion seems to not play nicely anymore with two types being the same except for the field name