spark-rapids Fix Python runtime error caused by numpy 2.0.0 release [databricks]

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Fixes #11070

pin numpy to 1.24.4 which is the latest version to support python3.8. This implies that we drop support to python 3.12
pin pandas to 1.4.3 which works for python3.8-3.10
pin pyArrow to 16.1.0 which works for both numpy 1.0 and 2.0

Jun 18 '24 16:06 amahussein

build

Jun 18 '24 21:06 amahussein

After merging this PR, @pxLi please revert the workaround https://github.com/NVIDIA/spark-rapids/pull/11072

Jun 18 '24 21:06 amahussein

If the fix is our decision not to support pandas2&numpy2, please also help include fix to pip install parts in, jenkins/databricks/setup.sh jenkins/Dockerfile-blossom.integration.ubuntu jenkins/Dockerfile-blossom.integration.rocky jenkins/Dockerfile-blossom.ubuntu integration_tests/README.md (update the Dependencies section)

and revert my workaround in the same PR, so the pre-merge-CI would cover the case, thanks!

Jun 19 '24 00:06 pxLi

If the fix is our decision not to support pandas2&numpy2, please also help include fix to pip install parts in, jenkins/databricks/setup.sh jenkins/Dockerfile-blossom.integration.ubuntu jenkins/Dockerfile-blossom.integration.rocky jenkins/Dockerfile-blossom.ubuntu integration_tests/README.md (update the Dependencies section)

and revert my workaround in the same PR, so the pre-merge-CI would cover the case, thanks!

Thanks @pxLi ! Numpy 2.0 requires Python3.9+ I believe the question comes down to which version of python the CI/CD is supporting. Using any package that depends on Numpy-2.0 implies dropping Python3.8 from in all the CI/CD.

integration_tests/README.md does not specify what is the range of supported 3.x. For instance, many package releases that works on Python 3.12 would not work on Python 3.8.

In order to make to make more changes to the suggested files. I like to have more information about:

Do you still want to support Python 3.8?
What is the highest python version. is it Python 3.11? Consider that pandas/numpy packages do not have any releases that work on both Python3.8 and Python 3.12.

Jun 19 '24 02:06 amahussein

@amahussein Thanks for taking care of this!

Do you still want to support Python 3.8? What is the highest python version. is it Python 3.11? Consider that pandas/numpy packages do not have any releases that work on both Python3.8 and Python 3.12. this more

These are more like questions for project owners, cc @sameerz @GaryShen2008 to help.

From DevOps perspective, currently, we have CI for Python 3.8,3.9,3.10 to cover existing databricks runtimes(11.X,12.X,13.X). And from pyspark installation page, they support all python 3.8 and above versions https://spark.apache.org/docs/latest/api/python/getting_started/install.html

So I assume we should support python 3.8+, if not then we may need to document some to clarify the limitations.

Jun 19 '24 03:06 pxLi

We need to continue supporting Python 3.8+, as that is the version supported by the current release of Spark (3.5.1 as of today - https://spark.apache.org/docs/latest/ ). We should support all higher versions of Python, regardless of numpy.

For this case, has issue #11070 recurred?

Jul 01 '24 17:07 sameerz

close with https://github.com/NVIDIA/spark-rapids/pull/11138. please let me know if we need some other fix, thanks

Jul 08 '24 01:07 pxLi

spark-rapids spark-rapids copied to clipboard

Fix Python runtime error caused by numpy 2.0.0 release [databricks]

spark-rapids
spark-rapids copied to clipboard