kedro-plugins
kedro-plugins copied to clipboard
[kedro-datasets] Upgrade to PySpark >= 3.4, Pandas >= 2 in `test_requirements.txt`
Description
Spark 3.4.0 was released in April. Our databricks
and spark
datasets should support this newer version of Spark, though it currently causes many tests to fail.
Also with this change, we should enforce Pandas >= 2, as earlier versions of Pandas are not compatible with Spark >= 3.4. This change will also enable us to upgrade delta-spark
.
Context
This is an important change as it will ensure our datasets work with the latest version of Spark.
I can't comment on spark, but be careful when forcing something like pandas >= 2.0 as users typically use other packages that might not be compatible with pandas 2.0 yet. For example great expectations (see explicit comment in the requirements file on their GitHub here). Furthermore, I would also set upper limits not to get into trouble later on.
Off topic: Upper version caps are tricky and I'm not very happy with how Great Expectations handled this. Neither https://github.com/great-expectations/great_expectations/pull/7571 nor https://github.com/great-expectations/great_expectations/pull/7553 explain what the test failures with Altair were, their CI logs are gone, and I don't see any issues upstream https://github.com/altair-viz/altair/issues?q=is%3Aissue+pandas
I agree it might be premature to force pandas 2.0. Doesn't PySpark 3.4 onwards carry their own version pinning? I know letting users install, say, PySpark 3.4 and pandas < 2 will get them weird errors, but if PySpark is not correctly pinning pandas, we should try to find another way if possible.
I also have doubt to pin pandas
>=2.0, I don't see the ecosystem will catch up that quickly and this shouldn't be done in at least the coming 12 months.
The test suite is a separate problem. It's an additional question how we should test our datasets. In any case, I would say we should tackle this in our test suite but not forcing it to our users.
- https://github.com/kedro-org/kedro/issues/1498
For example, if an user is using pyspark==3.2.0
, and pandas==1.5.3
, it shouldn't be blocked by kedro-datasets[spark]
.
Okay I just read the title carefully, so this is only about test_requirements.txt
, I misunderstood it is about the installation.
Do we have some idea what's failing when we have pandas>2.0
? Potentially we will touch/fix it when we try to add Python3.11 support.
Woops I also misread the title, thanks @noklam :+1:
Haha I also misread the title 😅. Thanks @noklam for pointing it out!
This is some kind of collective hallucinations 😂
Maybe a good remark to add here. All current versions of Spark are not compatible with Pandas >= 2! If you look at the Jira issue tracker of Spark, compatibility with Pandas 2.0 is foreseen for the next major version upgrade of Spark (Spark 4.0)
If you look at the Jira issue tracker of Spark, compatibility with Pandas 2.0 is foreseen for the next major version upgrade of Spark (Spark 4.0)
Do you have a link? I tried a quick search but Jira and I cannot be friends
@astrojuanlu: Sure, here is the link (note the affects version).