spark
spark copied to clipboard
[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types
What changes were proposed in this pull request?
- Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts
- Make tests compatible to new
__repr__of numerical scalars
Why are the changes needed?
PySpark references some code which was removed with NumPy 2.0:
np.NaNwas removed, should be replaced withnp.nannp.string_was removed, is an alias fornp.bytes_np.float_was removed, is defined the same asnp.doublenp.unicode_was removed, is an alias fornp.str_
NumPy 2.0 changed the __repr__ of numerical scalars to contain type information (e.g. np.int32(3) instead of 3). Old behavior can be enabled by setting numpy.printoptions(legacy="1.25") (or the older 1.21 and 1.13 legacy modes). There are multiple tests and doctests that rely on the old behavior.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tests for modules pyspark-connect, pyspark-core, pyspark-errors, pyspark-mllib, pyspark-pandas, pyspark-sql, pyspark-resource, pyspark-testing were executed in a local venv with numpy==2.0.0 installed.
Was this patch authored or co-authored using generative AI tooling?
No.
cc @itholic
Let's use the default PR template:
### What changes were proposed in this pull request?
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
### Was this patch authored or co-authored using generative AI tooling?
@itholic thanks for reviewing!
I added some small changes to the tests, to make them compatible with NumPy 2.0. MR description is updated accordingly and formatted according to the default template. I couldn't yet get the pyspark-ml tests running locally, so those tests are not yet tested with NumPy 2.0.
Maybe it makes sense to update the GitHub jobs to test with both the lowest supported (i.e. 1.21) and latest (i.e. 2.0.0) NumPy version?
Oh, okay seems like the NumPy is upgraded their major version recently (2024-06-17): Release Note.
@HyukjinKwon Maybe should we upgrade the minimum NumPy support to 2.0.0 as we did for Pandas??
Also cc @zhengruifeng who have worked on similar PR from https://github.com/apache/spark/pull/42944.
I think that's too aggressive. NumPy is also used in Spark ML, and many other dependent proejcts
cc @WeichenXu123
Seems fine from a cursory look but let's make the CI happy :-).
CI passed !
There are some linter failures: https://github.com/codesorcery/spark/actions/runs/9708335267/job/26795030107
There are some linter failures: https://github.com/codesorcery/spark/actions/runs/9708335267/job/26795030107
Added # type: ignore[arg-type] to the affected lines, since legacy="1.25" is only implemented in numpy>=2 and we're checking that the code path is only executed when numpy>=2 is installed.
Have we test this changes against numpy 2.0?
I've tested it on my local workstation, as written in the PR description. There aren't any CI jobs testing with NumPy 2.0 yet.
To make sure that no calls to code removed in NumPy 2 is made, we could also use ruff in dev/lint-python since it can check for usage of NumPy 2 deprecations.
(ruff can also be used as a faster replacement of both flake8 and black, but that should be out of scope here)
The change seems fine to me. @codesorcery do you mind creating a PR to set the upperbound at setup.py like numpy < 2? I think NumPy release will affect 3.5 users too.
@codesorcery do you mind creating a PR to set the upperbound at
setup.pylikenumpy < 2? I think NumPy release will affect 3.5 users too.
@HyukjinKwon you mean for branches where this PR doesn't get applied? Otherwise, most Python package managers and tools like Renovate won't allow users to update to NumPy 2 with this bound set.
Maybe also of interest: there is a list tracking the compatibility status of Python libraries with NumPy 2.0 at https://github.com/numpy/numpy/issues/26191
Yes, branch-3.5.
@HyukjinKwon here's the PR for branch-3.5 limiting numpy<2: https://github.com/apache/spark/pull/47175 (also auto-linked by GitHub above)
It should be possible to write code that is compatible with NumPy 1 & 2. That is what most projects are doing
Would look over the migration guide. There are more suggestions in the release notes
As already noted ruff's NumPy 2 plugin can be a great help in migrating code
cc @rgommers (for awareness)
Merged to master.
@codesorcery @HyukjinKwon I noticed that a review comment here said np.set_printoptions is used in the tests and can be updated, but this PR uses it in pyspark/core/ rather than in tests. np.set_printoptions changes global state within numpy, and doing that from within another library is a big no-no usually. Could you please consider changing this?
but this PR uses it in
pyspark/core/rather than in tests
@rgommers np.set_printoptions is only called inside def _test() -> None: in these classes, which setup and run doctest. It's not called from any function that would be called when PySpark is used as a library.
Ah there are tests within source files, I missed that. Sorry for the noise!
If there are numpy-related test cases, can we move them out from rdd module?
IIUC, this problem is already pointed out by @rgommers one month ago in this PR.
Here is a follow-up according to @HyukjinKwon 's comment.
- https://github.com/apache/spark/pull/47526