spark-rapids icon indicating copy to clipboard operation
spark-rapids copied to clipboard

Fix Python runtime error caused by numpy 2.0.0 release [databricks]

Open amahussein opened this issue 1 year ago • 6 comments

Signed-off-by: Ahmed Hussein (amahussein) [email protected]

Fixes #11070

  • pin numpy to 1.24.4 which is the latest version to support python3.8. This implies that we drop support to python 3.12
  • pin pandas to 1.4.3 which works for python3.8-3.10
  • pin pyArrow to 16.1.0 which works for both numpy 1.0 and 2.0

amahussein avatar Jun 18 '24 16:06 amahussein

build

amahussein avatar Jun 18 '24 21:06 amahussein

After merging this PR, @pxLi please revert the workaround https://github.com/NVIDIA/spark-rapids/pull/11072

amahussein avatar Jun 18 '24 21:06 amahussein

If the fix is our decision not to support pandas2&numpy2, please also help include fix to pip install parts in, jenkins/databricks/setup.sh jenkins/Dockerfile-blossom.integration.ubuntu jenkins/Dockerfile-blossom.integration.rocky jenkins/Dockerfile-blossom.ubuntu integration_tests/README.md (update the Dependencies section)

and revert my workaround in the same PR, so the pre-merge-CI would cover the case, thanks!

pxLi avatar Jun 19 '24 00:06 pxLi

If the fix is our decision not to support pandas2&numpy2, please also help include fix to pip install parts in, jenkins/databricks/setup.sh jenkins/Dockerfile-blossom.integration.ubuntu jenkins/Dockerfile-blossom.integration.rocky jenkins/Dockerfile-blossom.ubuntu integration_tests/README.md (update the Dependencies section)

and revert my workaround in the same PR, so the pre-merge-CI would cover the case, thanks!

Thanks @pxLi ! Numpy 2.0 requires Python3.9+ I believe the question comes down to which version of python the CI/CD is supporting. Using any package that depends on Numpy-2.0 implies dropping Python3.8 from in all the CI/CD.

integration_tests/README.md does not specify what is the range of supported 3.x. For instance, many package releases that works on Python 3.12 would not work on Python 3.8.

In order to make to make more changes to the suggested files. I like to have more information about:

  • Do you still want to support Python 3.8?
  • What is the highest python version. is it Python 3.11? Consider that pandas/numpy packages do not have any releases that work on both Python3.8 and Python 3.12.

amahussein avatar Jun 19 '24 02:06 amahussein

@amahussein Thanks for taking care of this!

Do you still want to support Python 3.8? What is the highest python version. is it Python 3.11? Consider that pandas/numpy packages do not have any releases that work on both Python3.8 and Python 3.12. this more

These are more like questions for project owners, cc @sameerz @GaryShen2008 to help.

From DevOps perspective, currently, we have CI for Python 3.8,3.9,3.10 to cover existing databricks runtimes(11.X,12.X,13.X). And from pyspark installation page, they support all python 3.8 and above versions https://spark.apache.org/docs/latest/api/python/getting_started/install.html

So I assume we should support python 3.8+, if not then we may need to document some to clarify the limitations.

pxLi avatar Jun 19 '24 03:06 pxLi

We need to continue supporting Python 3.8+, as that is the version supported by the current release of Spark (3.5.1 as of today - https://spark.apache.org/docs/latest/ ). We should support all higher versions of Python, regardless of numpy.

For this case, has issue #11070 recurred?

sameerz avatar Jul 01 '24 17:07 sameerz

close with https://github.com/NVIDIA/spark-rapids/pull/11138. please let me know if we need some other fix, thanks

pxLi avatar Jul 08 '24 01:07 pxLi