delta-rs icon indicating copy to clipboard operation
delta-rs copied to clipboard

chore: upgrade to datafusion 43

Open ion-elgreco opened this issue 1 year ago • 12 comments

Description

Bump kernel

  • closes https://github.com/delta-io/delta-rs/issues/2850

ion-elgreco avatar Sep 14 '24 09:09 ion-elgreco

We need Datafusion to upgrade as well before we can do this. I have some patches to make the latest datafusion work but we're blocked on them to adopt the newest arrow and kernel

rtyler avatar Sep 16 '24 16:09 rtyler

Hi, datafusion has been released. Let's rock!

Xuanwo avatar Sep 18 '24 13:09 Xuanwo

Thank you @ion-elgreco -- let me know if there is anything I can to do help with this PR (e.g. I could work on updating the code to use non deprecated APIs in the core for parquet statistics...)

alamb avatar Sep 20 '24 10:09 alamb

Hey @alamb, I am not working on this atm, but @rtyler has picked it up from here

ion-elgreco avatar Sep 20 '24 11:09 ion-elgreco

@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.

A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.

I had a quick glance at the Changelog but can't really pinpoint where this change may have occurred, do you have any ideas?

DFSchema { inner: 
Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "element", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }, 
new schema: 
DFSchema { inner: Schema { fields: [
Field { name: "items_in_bucket", data_type: List(Field { name: "item", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} },

ion-elgreco avatar Sep 20 '24 14:09 ion-elgreco

@alamb there seems to be a regression in DF, see https://github.com/delta-io/delta-rs/actions/runs/10959345721/job/30431504526?pr=2886#step:7:948.

A literal with a list is creating a dtype list with inner type using the name "item", we however have our data internally read according to the parquet spec, so that when we write the parquet files are meeting the field naming of the spec.

It sounds somewhat similar to a discussion happening here about nullability (but I think field name is similar in terms of how it is treated): https://github.com/apache/datafusion/pull/11989#issuecomment-2360601240

Update; filed https://github.com/apache/datafusion/issues/12560 to track the regression more

alamb avatar Sep 20 '24 16:09 alamb

@alamb thanks for following up! I feel a little sheepish :sheep: insofar that my internal CI showed these errors weeks ago in my integration testing between datafusion main with delta-rs main, I just didn't have the time to triage them like @ion-elgreco has :smile:

I've subscribed to the issue you created and can help testing if necessary! Depending on my availability this weekend I can try fixing it in datafusion too :thinking:

rtyler avatar Sep 20 '24 16:09 rtyler

Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.

matthewmturner avatar Oct 07 '24 13:10 matthewmturner

Hi there - I believe a fix was implemented in datafusion and this PR is not blocked any longer.

It's not yet released though

ion-elgreco avatar Oct 07 '24 14:10 ion-elgreco

I can confirm that the latest datafusion main passes tests in CI :partying_face:

rtyler avatar Oct 07 '24 20:10 rtyler

fix

I filed https://github.com/apache/datafusion/issues/12813 to consider releasing a patch version of DataFusion

alamb avatar Oct 08 '24 15:10 alamb

Hi, is the title of this PR incorrect? We're going to DF 42, right?

Xuanwo avatar Oct 11 '24 09:10 Xuanwo

Hi, is the title of this PR incorrect? We're going to DF 42, right?

The title is not incorrect :laughing: There will be no deltalake release with datafusion 42 because of the regression identified. Additionally it doesn't look like the datafusion community is going to create a 42.1.0, which means we have no choice but to skip 42 and wait for 43 :sob:

rtyler avatar Oct 16 '24 10:10 rtyler

The title is not incorrect 😆 There will be no deltalake release with datafusion 42 because of the regression identified. Additionally it doesn't look like the datafusion community is going to create a 42.1.0, which means we have no choice but to skip 42 and wait for 43 😭

Understood. It's somewhat sad that our efforts to upgrade to Arrow 53 have been blocked by this issue.

Xuanwo avatar Oct 16 '24 10:10 Xuanwo

Additionally it doesn't look like the datafusion community is going to create a 42.1.0, which means we have no choice but to skip 42 and wait for 43 😭

One alternate would be for someone to help us make a 42.1.0 (no one volunteered on https://github.com/apache/datafusion/issues/12813)

alamb avatar Oct 16 '24 14:10 alamb

One alternate would be for someone to help us make a 42.1.0 (no one volunteered on apache/datafusion#12813)

Hi, I'm willing to help for this.

Xuanwo avatar Oct 16 '24 14:10 Xuanwo

One alternate would be for someone to help us make a 42.1.0 (no one volunteered on apache/datafusion#12813)

Hi, I'm willing to help for this.

Update here is that we have found some help and will try and do this release sometime later this week or early next week

alamb avatar Oct 16 '24 16:10 alamb

Updates: we only need one more binding vote to make this release.

Xuanwo avatar Oct 20 '24 05:10 Xuanwo

Updates: we only need one more binding vote to make this release.

❤️ -- I think we have the votes, I am just now waiting for the 72 hour timeline to elapse and then I will publish the release

alamb avatar Oct 20 '24 10:10 alamb

❤️ -- I think we have the votes, I am just now waiting for the 72 hour timeline to elapse and then I will publish the release

That's great. I've been waiting for this for so long. Happy that I can contribute to pushing this forward.

Xuanwo avatar Oct 20 '24 11:10 Xuanwo

https://crates.io/crates/datafusion/42.1.0 is now available

alamb avatar Oct 20 '24 18:10 alamb

There is still one failing test, which should be easy to fix. I don't have access to my PC though to resolve it, @Xuanwo would you perhaps be able to push to this branch?

ion-elgreco avatar Oct 20 '24 18:10 ion-elgreco

There is still one failing test, which should be easy to fix. I don't have access to my PC though to resolve it, @Xuanwo would you perhaps be able to push to this branch?

I would really like to, but I'm not a committer for delta-rs. Perhaps we should seek assistance from @rtyler.

Xuanwo avatar Oct 20 '24 18:10 Xuanwo

There is still one failing test, which should be easy to fix. I don't have access to my PC though to resolve it, @Xuanwo would you perhaps be able to push to this branch?

I would really like to, but I'm not a committer for delta-rs. Perhaps we should seek assistance from @rtyler.

You could potentially make a PR (in the https://github.com/ion-elgreco/delta-rs fork) that targets this branch that @ion-elgreco could merge in

However, isn't it (very) late in your timezone 💤 🤔

alamb avatar Oct 20 '24 20:10 alamb

@ion-elgreco I fixed your test you just have to merge https://github.com/ion-elgreco/delta-rs/pull/4

hntd187 avatar Oct 20 '24 21:10 hntd187

Codecov Report

Attention: Patch coverage is 82.59494% with 55 lines in your changes missing coverage. Please review.

Project coverage is 72.46%. Comparing base (0cb091c) to head (b144cf1). Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/delta_datafusion/expr.rs 70.87% 26 Missing and 4 partials :warning:
crates/sql/src/logical_plan.rs 20.00% 12 Missing :warning:
crates/core/src/writer/stats.rs 65.00% 0 Missing and 7 partials :warning:
crates/core/src/kernel/scalars.rs 0.00% 2 Missing :warning:
crates/core/src/operations/update.rs 98.56% 0 Missing and 2 partials :warning:
crates/core/src/delta_datafusion/schema_adapter.rs 88.88% 0 Missing and 1 partial :warning:
python/src/lib.rs 0.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2886      +/-   ##
==========================================
+ Coverage   72.39%   72.46%   +0.07%     
==========================================
  Files         128      128              
  Lines       40530    40823     +293     
  Branches    40530    40823     +293     
==========================================
+ Hits        29341    29583     +242     
- Misses       9306     9356      +50     
- Partials     1883     1884       +1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.


🚨 Try these New Features:

codecov[bot] avatar Oct 20 '24 21:10 codecov[bot]

However, isn't it (very) late in your timezone 💤 🤔

Haha, indeed. This was the last message I sent yesterday.

@ion-elgreco I fixed your test you just have to merge https://github.com/ion-elgreco/delta-rs/pull/4

Thank you @hntd187 for making progress.

Xuanwo avatar Oct 21 '24 02:10 Xuanwo

Unfortunately still one failure: https://github.com/delta-io/delta-rs/actions/runs/11429877833/job/31796752448#step:5:858

ion-elgreco avatar Oct 21 '24 05:10 ion-elgreco

Hi, there appears to be a strange issue: items_in_bucket is a list. Its field used to be named element, but it has now been changed to item.

Any ideas to this? cc @ion-elgreco @alamb

Xuanwo avatar Oct 21 '24 10:10 Xuanwo

Hi, there appears to be a strange issue: items_in_bucket is a list. Its field used to be named element, but it has now been changed to item.

Any ideas to this? cc @ion-elgreco @alamb

We switched in kernel-rs to use parquet structure for these arrow types. But Datafusion seems to not play nicely anymore with two types being the same except for the field name

ion-elgreco avatar Oct 21 '24 10:10 ion-elgreco