spark-rapids
spark-rapids copied to clipboard
Fix Python runtime error caused by numpy 2.0.0 release [databricks]
Signed-off-by: Ahmed Hussein (amahussein) [email protected]
Fixes #11070
- pin numpy to 1.24.4 which is the latest version to support python3.8. This implies that we drop support to python 3.12
- pin pandas to 1.4.3 which works for python3.8-3.10
- pin pyArrow to 16.1.0 which works for both numpy 1.0 and 2.0
build
After merging this PR, @pxLi please revert the workaround https://github.com/NVIDIA/spark-rapids/pull/11072
If the fix is our decision not to support pandas2&numpy2,
please also help include fix to pip install parts in,
jenkins/databricks/setup.sh
jenkins/Dockerfile-blossom.integration.ubuntu
jenkins/Dockerfile-blossom.integration.rocky
jenkins/Dockerfile-blossom.ubuntu
integration_tests/README.md (update the Dependencies section)
and revert my workaround in the same PR, so the pre-merge-CI would cover the case, thanks!
If the fix is our decision not to support pandas2&numpy2, please also help include fix to
pip installparts in, jenkins/databricks/setup.sh jenkins/Dockerfile-blossom.integration.ubuntu jenkins/Dockerfile-blossom.integration.rocky jenkins/Dockerfile-blossom.ubuntu integration_tests/README.md (update theDependenciessection)and revert my workaround in the same PR, so the pre-merge-CI would cover the case, thanks!
Thanks @pxLi ! Numpy 2.0 requires Python3.9+ I believe the question comes down to which version of python the CI/CD is supporting. Using any package that depends on Numpy-2.0 implies dropping Python3.8 from in all the CI/CD.
integration_tests/README.md does not specify what is the range of supported 3.x. For instance, many package releases that works on Python 3.12 would not work on Python 3.8.
In order to make to make more changes to the suggested files. I like to have more information about:
- Do you still want to support Python 3.8?
- What is the highest python version. is it Python 3.11? Consider that pandas/numpy packages do not have any releases that work on both Python3.8 and Python 3.12.
@amahussein Thanks for taking care of this!
Do you still want to support Python 3.8? What is the highest python version. is it Python 3.11? Consider that pandas/numpy packages do not have any releases that work on both Python3.8 and Python 3.12. this more
These are more like questions for project owners, cc @sameerz @GaryShen2008 to help.
From DevOps perspective, currently, we have CI for Python 3.8,3.9,3.10 to cover existing databricks runtimes(11.X,12.X,13.X). And from pyspark installation page, they support all python 3.8 and above versions https://spark.apache.org/docs/latest/api/python/getting_started/install.html
So I assume we should support python 3.8+, if not then we may need to document some to clarify the limitations.
We need to continue supporting Python 3.8+, as that is the version supported by the current release of Spark (3.5.1 as of today - https://spark.apache.org/docs/latest/ ). We should support all higher versions of Python, regardless of numpy.
For this case, has issue #11070 recurred?
close with https://github.com/NVIDIA/spark-rapids/pull/11138. please let me know if we need some other fix, thanks