arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-25118: [Python] Make NumPy an optional runtime dependency

Open raulcd opened this issue 1 year ago • 6 comments

Rationale for this change

Being able to run pyarrow without requiring numpy.

What changes are included in this PR?

If numpy is not present we are able to import pyarrow and run functionality. A new CI job has been created to run some basic tests without numpy.

Are these changes tested?

Yes via CI.

Are there any user-facing changes?

Yes, NumPy can be removed from the user installation and pyarrow functionality still works

  • GitHub Issue: #25118

raulcd avatar May 31 '24 10:05 raulcd

@jorisvandenbossche @pitrou As part of this initial PR I've decided to create a new mark pytest.mark.without_numpy (feel free to propose a different name) that will be used to collect tests on the CI job when numpy is not present. The mark won't affect the tests when numpy is present and those tests will also be run. I think we can follow up test refactoring or splitting some test modules (like test_array.py) in order to be able to run the full test suite with and without numpy in a follow up PR instead of trying to add it into this PR. I think this PR is already adding the functionality we are looking for and with the current tests being run and the new CI job we are testing basic functionality and validating compute, dataset, array are working without numpy. I am marking this PR as ready for review now.

raulcd avatar Jun 25 '24 12:06 raulcd

The CI test faiures are unrelated (mainly due to: https://github.com/apache/arrow/issues/42149)

raulcd avatar Jun 25 '24 12:06 raulcd

I'm really -1 on the current testing approach where everything is disabled by default and tests have to be whitelisted twice to be executed.

pitrou avatar Jun 26 '24 13:06 pitrou

tests have to be whitelisted twice

We could also only do the file selection, and for those files remove the marker (which means that either all tests in their have to run without numpy, or be marked with pytest.mark.numpy)

And then we can in follow-ups gradually make that list of files larger until it's all test files and we can simply run pytest without a specific file selection.

jorisvandenbossche avatar Jun 26 '24 13:06 jorisvandenbossche

I'm really -1 on the current testing approach where everything is disabled by default and tests have to be whitelisted twice to be executed.

Thanks @pitrou as pointed on the PR this was just an initial (possible) solution to have some tests being run, I am ok if we want to remove the without_numpy mark but I would still prefer to only run some specific modules.

raulcd avatar Jun 26 '24 13:06 raulcd

I still have to do a couple of things before is ready for review again but after going over all tests we have 190 tests marked as requiring numpy:

~/code/arrow/python  (GH-25118)  $ grep -R "@pytest.mark.numpy" . | wc -l
190

Even though it seems a lot we have 268 for pandas:

~/code/arrow/python  (GH-25118)  $ grep -R "@pytest.mark.pandas" . | wc -l
268

With those changes we are running the following tests without numpy:

2258 passed, 934 skipped, 17 xfailed, 1 xpassed in 136.27s

raulcd avatar Jun 26 '24 16:06 raulcd

@pitrou @jorisvandenbossche I've gone through all the review comments and acted upon them, rebased and fixed some minor related CI issues. Both the jobs with and without numpy scans and run all relevant tests. We only have two marks: numpy and nonumpy which work the same as it does with the pandas and nopandas today. This is ready for a new round of review. Thanks for your help!

raulcd avatar Jul 23 '24 11:07 raulcd

@github-actions crossbow submit -g python

jorisvandenbossche avatar Jul 30 '24 13:07 jorisvandenbossche

Revision: 66120c1ff9afb57d302f14da28bc0c6569abd2a1

Submitted crossbow builds: ursacomputing/crossbow @ actions-78d037afc2

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0-numpy-1.19 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest-numpy-latest GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-20.04-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions

github-actions[bot] avatar Jul 30 '24 13:07 github-actions[bot]

@github-actions crossbow submit test-conda-python-3.11-hypothesis

jorisvandenbossche avatar Aug 01 '24 11:08 jorisvandenbossche

Revision: b9fa212e4364c2c41728a05bb7ee2ff970cf917e

Submitted crossbow builds: ursacomputing/crossbow @ actions-20646cd2d7

Task Status
test-conda-python-3.11-hypothesis GitHub Actions

github-actions[bot] avatar Aug 01 '24 11:08 github-actions[bot]

@github-actions crossbow submit -g python

raulcd avatar Aug 01 '24 13:08 raulcd

Revision: 0e41898236632e04e7fe1f916e771b0a8783a0e4

Submitted crossbow builds: ursacomputing/crossbow @ actions-5d9ffc81f3

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0-numpy-1.19 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest-numpy-latest GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-20.04-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions

github-actions[bot] avatar Aug 01 '24 13:08 github-actions[bot]

CI test failures are unrelated

raulcd avatar Aug 01 '24 14:08 raulcd

@github-actions crossbow submit test-conda-python-3.11-hypothesis

raulcd avatar Aug 02 '24 09:08 raulcd

Revision: 980d2a7dd8ceb722fe9bc2d7d376fef8569750e3

Submitted crossbow builds: ursacomputing/crossbow @ actions-20ce23a13a

Task Status
test-conda-python-3.11-hypothesis GitHub Actions

github-actions[bot] avatar Aug 02 '24 09:08 github-actions[bot]

@pitrou @jorisvandenbossche I think I've gone through all the comments on the PR. I will be off for some days since Thursday, let me know if there's anything else you want me to change before that :)

raulcd avatar Aug 06 '24 16:08 raulcd

CI failures are unrelated. @jorisvandenbossche @pitrou I've gone through the last comments again. Sorry, I had to rebase in order to fix some issues. The last commits, from https://github.com/apache/arrow/pull/41904/commits/a566aeca944707f31f0ad93b06651948aca88174 onwards, are the new commits.

raulcd avatar Aug 26 '24 17:08 raulcd

All good from my side!

While this PR does make it possible to use pyarrow at runtime without numpy installed, it does not actually makes the dependency optional, in a standard packaging sense. Do we want to discuss that in a follow-up issue whether we also want to actually remove numpy from the required dependencies list?

jorisvandenbossche avatar Aug 27 '24 13:08 jorisvandenbossche

Do we want to discuss that in a follow-up issue whether we also want to actually remove numpy from the required dependencies list?

I was planning to create a follow up issue where we remove numpy from the required dependencies list. I've just created: https://github.com/apache/arrow/issues/43846

raulcd avatar Aug 27 '24 13:08 raulcd

I've rebased again to keep it up to date with main. @pitrou do you have any other comments / concerns before merging?

raulcd avatar Aug 30 '24 08:08 raulcd

@github-actions crossbow submit -g wheel -g python

pitrou avatar Sep 02 '24 13:09 pitrou

Revision: 08da867e92a82ddb20f9f73ac52b7c69dfb8b041

Submitted crossbow builds: ursacomputing/crossbow @ actions-e575e4ce7c

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-cython2 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.10-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.10-substrait GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.8 GitHub Actions
test-conda-python-3.8-pandas-1.0-numpy-1.19 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-latest-numpy-latest GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-20.04-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
wheel-macos-monterey-cp310-amd64 GitHub Actions
wheel-macos-monterey-cp310-arm64 GitHub Actions
wheel-macos-monterey-cp311-amd64 GitHub Actions
wheel-macos-monterey-cp311-arm64 GitHub Actions
wheel-macos-monterey-cp312-amd64 GitHub Actions
wheel-macos-monterey-cp312-arm64 GitHub Actions
wheel-macos-monterey-cp313-amd64 GitHub Actions
wheel-macos-monterey-cp313-arm64 GitHub Actions
wheel-macos-monterey-cp38-amd64 GitHub Actions
wheel-macos-monterey-cp38-arm64 GitHub Actions
wheel-macos-monterey-cp39-amd64 GitHub Actions
wheel-macos-monterey-cp39-arm64 GitHub Actions
wheel-manylinux-2-28-cp310-amd64 GitHub Actions
wheel-manylinux-2-28-cp310-arm64 GitHub Actions
wheel-manylinux-2-28-cp311-amd64 GitHub Actions
wheel-manylinux-2-28-cp311-arm64 GitHub Actions
wheel-manylinux-2-28-cp312-amd64 GitHub Actions
wheel-manylinux-2-28-cp312-arm64 GitHub Actions
wheel-manylinux-2-28-cp313-amd64 GitHub Actions
wheel-manylinux-2-28-cp313-arm64 GitHub Actions
wheel-manylinux-2-28-cp38-amd64 GitHub Actions
wheel-manylinux-2-28-cp38-arm64 GitHub Actions
wheel-manylinux-2-28-cp39-amd64 GitHub Actions
wheel-manylinux-2-28-cp39-arm64 GitHub Actions
wheel-manylinux-2014-cp310-amd64 GitHub Actions
wheel-manylinux-2014-cp310-arm64 GitHub Actions
wheel-manylinux-2014-cp311-amd64 GitHub Actions
wheel-manylinux-2014-cp311-arm64 GitHub Actions
wheel-manylinux-2014-cp312-amd64 GitHub Actions
wheel-manylinux-2014-cp312-arm64 GitHub Actions
wheel-manylinux-2014-cp313-amd64 GitHub Actions
wheel-manylinux-2014-cp313-arm64 GitHub Actions
wheel-manylinux-2014-cp38-amd64 GitHub Actions
wheel-manylinux-2014-cp38-arm64 GitHub Actions
wheel-manylinux-2014-cp39-amd64 GitHub Actions
wheel-manylinux-2014-cp39-arm64 GitHub Actions
wheel-windows-cp310-amd64 GitHub Actions
wheel-windows-cp311-amd64 GitHub Actions
wheel-windows-cp312-amd64 GitHub Actions
wheel-windows-cp313-amd64 GitHub Actions
wheel-windows-cp38-amd64 GitHub Actions
wheel-windows-cp39-amd64 GitHub Actions

github-actions[bot] avatar Sep 02 '24 13:09 github-actions[bot]

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 9ab9532a208d5632b0f8b5035a207235b5e6b828.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Thanks @raulcd!

jorisvandenbossche avatar Sep 03 '24 08:09 jorisvandenbossche