spark [SPARK-48710][PYTHON] Use NumPy 2.0 compatible types

What changes were proposed in this pull request?

Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts
Make tests compatible to new __repr__ of numerical scalars

Why are the changes needed?

PySpark references some code which was removed with NumPy 2.0:

np.NaN was removed, should be replaced with np.nan
np.string_ was removed, is an alias for np.bytes_
np.float_ was removed, is defined the same as np.double
np.unicode_ was removed, is an alias for np.str_

NumPy 2.0 changed the __repr__ of numerical scalars to contain type information (e.g. np.int32(3) instead of 3). Old behavior can be enabled by setting numpy.printoptions(legacy="1.25") (or the older 1.21 and 1.13 legacy modes). There are multiple tests and doctests that rely on the old behavior.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Tests for modules pyspark-connect, pyspark-core, pyspark-errors, pyspark-mllib, pyspark-pandas, pyspark-sql, pyspark-resource, pyspark-testing were executed in a local venv with numpy==2.0.0 installed.

Was this patch authored or co-authored using generative AI tooling?

No.

Jun 25 '24 11:06 codesorcery

cc @itholic

Jun 26 '24 17:06 allisonwang-db

Let's use the default PR template:

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

Jun 27 '24 00:06 itholic

@itholic thanks for reviewing! I added some small changes to the tests, to make them compatible with NumPy 2.0. MR description is updated accordingly and formatted according to the default template. I couldn't yet get the pyspark-ml tests running locally, so those tests are not yet tested with NumPy 2.0.

Maybe it makes sense to update the GitHub jobs to test with both the lowest supported (i.e. 1.21) and latest (i.e. 2.0.0) NumPy version?

Jun 27 '24 20:06 codesorcery

Oh, okay seems like the NumPy is upgraded their major version recently (2024-06-17): Release Note.

@HyukjinKwon Maybe should we upgrade the minimum NumPy support to 2.0.0 as we did for Pandas??

Also cc @zhengruifeng who have worked on similar PR from https://github.com/apache/spark/pull/42944.

Jun 30 '24 22:06 itholic

I think that's too aggressive. NumPy is also used in Spark ML, and many other dependent proejcts

Jun 30 '24 23:06 HyukjinKwon

cc @WeichenXu123

Jun 30 '24 23:06 HyukjinKwon

Seems fine from a cursory look but let's make the CI happy :-).

Jul 01 '24 00:07 HyukjinKwon

CI passed !

Jul 01 '24 03:07 WeichenXu123

There are some linter failures: https://github.com/codesorcery/spark/actions/runs/9708335267/job/26795030107

Jul 01 '24 03:07 HyukjinKwon

There are some linter failures: https://github.com/codesorcery/spark/actions/runs/9708335267/job/26795030107

Added # type: ignore[arg-type] to the affected lines, since legacy="1.25" is only implemented in numpy>=2 and we're checking that the code path is only executed when numpy>=2 is installed.

Jul 01 '24 09:07 codesorcery

Have we test this changes against numpy 2.0?

I've tested it on my local workstation, as written in the PR description. There aren't any CI jobs testing with NumPy 2.0 yet. To make sure that no calls to code removed in NumPy 2 is made, we could also use ruff in dev/lint-python since it can check for usage of NumPy 2 deprecations. (ruff can also be used as a faster replacement of both flake8 and black, but that should be out of scope here)

Jul 01 '24 11:07 codesorcery

The change seems fine to me. @codesorcery do you mind creating a PR to set the upperbound at setup.py like numpy < 2? I think NumPy release will affect 3.5 users too.

Jul 02 '24 07:07 HyukjinKwon

@codesorcery do you mind creating a PR to set the upperbound at setup.py like numpy < 2? I think NumPy release will affect 3.5 users too.

@HyukjinKwon you mean for branches where this PR doesn't get applied? Otherwise, most Python package managers and tools like Renovate won't allow users to update to NumPy 2 with this bound set.

Maybe also of interest: there is a list tracking the compatibility status of Python libraries with NumPy 2.0 at https://github.com/numpy/numpy/issues/26191

Jul 02 '24 08:07 codesorcery

Yes, branch-3.5.

Jul 02 '24 08:07 HyukjinKwon

@HyukjinKwon here's the PR for branch-3.5 limiting numpy<2: https://github.com/apache/spark/pull/47175 (also auto-linked by GitHub above)

Jul 02 '24 09:07 codesorcery

It should be possible to write code that is compatible with NumPy 1 & 2. That is what most projects are doing

Would look over the migration guide. There are more suggestions in the release notes

As already noted ruff's NumPy 2 plugin can be a great help in migrating code

cc @rgommers (for awareness)

Jul 02 '24 22:07 jakirkham

Merged to master.

Jul 03 '24 00:07 HyukjinKwon

@codesorcery @HyukjinKwon I noticed that a review comment here said np.set_printoptions is used in the tests and can be updated, but this PR uses it in pyspark/core/ rather than in tests. np.set_printoptions changes global state within numpy, and doing that from within another library is a big no-no usually. Could you please consider changing this?

Jul 03 '24 13:07 rgommers

but this PR uses it in pyspark/core/ rather than in tests

@rgommers np.set_printoptions is only called inside def _test() -> None: in these classes, which setup and run doctest. It's not called from any function that would be called when PySpark is used as a library.

Jul 03 '24 13:07 codesorcery

Ah there are tests within source files, I missed that. Sorry for the noise!

Jul 03 '24 13:07 rgommers

If there are numpy-related test cases, can we move them out from rdd module?

IIUC, this problem is already pointed out by @rgommers one month ago in this PR.

Jul 29 '24 19:07 dongjoon-hyun

Here is a follow-up according to @HyukjinKwon 's comment.

https://github.com/apache/spark/pull/47526

Jul 29 '24 23:07 dongjoon-hyun

spark spark copied to clipboard

[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

spark
spark copied to clipboard