kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in `test_requirements.txt`

Open jmholzer opened this issue 1 year ago • 11 comments

Description

Spark 3.4.0 was released in April. Our databricks and spark datasets should support this newer version of Spark, though it currently causes many tests to fail.

Also with this change, we should enforce Pandas >= 2, as earlier versions of Pandas are not compatible with Spark >= 3.4. This change will also enable us to upgrade delta-spark.

Context

This is an important change as it will ensure our datasets work with the latest version of Spark.

jmholzer avatar May 19 '23 16:05 jmholzer

I can't comment on spark, but be careful when forcing something like pandas >= 2.0 as users typically use other packages that might not be compatible with pandas 2.0 yet. For example great expectations (see explicit comment in the requirements file on their GitHub here). Furthermore, I would also set upper limits not to get into trouble later on.

MatthiasRoels avatar Jun 12 '23 12:06 MatthiasRoels

Off topic: Upper version caps are tricky and I'm not very happy with how Great Expectations handled this. Neither https://github.com/great-expectations/great_expectations/pull/7571 nor https://github.com/great-expectations/great_expectations/pull/7553 explain what the test failures with Altair were, their CI logs are gone, and I don't see any issues upstream https://github.com/altair-viz/altair/issues?q=is%3Aissue+pandas

astrojuanlu avatar Jun 12 '23 13:06 astrojuanlu

I agree it might be premature to force pandas 2.0. Doesn't PySpark 3.4 onwards carry their own version pinning? I know letting users install, say, PySpark 3.4 and pandas < 2 will get them weird errors, but if PySpark is not correctly pinning pandas, we should try to find another way if possible.

astrojuanlu avatar Jun 12 '23 13:06 astrojuanlu

I also have doubt to pin pandas >=2.0, I don't see the ecosystem will catch up that quickly and this shouldn't be done in at least the coming 12 months.

The test suite is a separate problem. It's an additional question how we should test our datasets. In any case, I would say we should tackle this in our test suite but not forcing it to our users.

  • https://github.com/kedro-org/kedro/issues/1498

For example, if an user is using pyspark==3.2.0, and pandas==1.5.3, it shouldn't be blocked by kedro-datasets[spark].

noklam avatar Jun 12 '23 14:06 noklam

Okay I just read the title carefully, so this is only about test_requirements.txt, I misunderstood it is about the installation.

Do we have some idea what's failing when we have pandas>2.0? Potentially we will touch/fix it when we try to add Python3.11 support.

noklam avatar Jun 12 '23 14:06 noklam

Woops I also misread the title, thanks @noklam :+1:

astrojuanlu avatar Jun 12 '23 15:06 astrojuanlu

Haha I also misread the title 😅. Thanks @noklam for pointing it out!

MatthiasRoels avatar Jun 12 '23 19:06 MatthiasRoels

This is some kind of collective hallucinations 😂

noklam avatar Jun 12 '23 20:06 noklam

Maybe a good remark to add here. All current versions of Spark are not compatible with Pandas >= 2! If you look at the Jira issue tracker of Spark, compatibility with Pandas 2.0 is foreseen for the next major version upgrade of Spark (Spark 4.0)

MatthiasRoels avatar Sep 27 '23 08:09 MatthiasRoels

If you look at the Jira issue tracker of Spark, compatibility with Pandas 2.0 is foreseen for the next major version upgrade of Spark (Spark 4.0)

Do you have a link? I tried a quick search but Jira and I cannot be friends

astrojuanlu avatar Sep 27 '23 09:09 astrojuanlu

@astrojuanlu: Sure, here is the link (note the affects version).

MatthiasRoels avatar Sep 27 '23 11:09 MatthiasRoels