arrow GH-25118: [Python] Make NumPy an optional runtime dependency

Rationale for this change

Being able to run pyarrow without requiring numpy.

What changes are included in this PR?

If numpy is not present we are able to import pyarrow and run functionality. A new CI job has been created to run some basic tests without numpy.

Are these changes tested?

Yes via CI.

Are there any user-facing changes?

Yes, NumPy can be removed from the user installation and pyarrow functionality still works

GitHub Issue: #25118

May 31 '24 10:05 raulcd

@jorisvandenbossche @pitrou As part of this initial PR I've decided to create a new mark pytest.mark.without_numpy (feel free to propose a different name) that will be used to collect tests on the CI job when numpy is not present. The mark won't affect the tests when numpy is present and those tests will also be run. I think we can follow up test refactoring or splitting some test modules (like test_array.py) in order to be able to run the full test suite with and without numpy in a follow up PR instead of trying to add it into this PR. I think this PR is already adding the functionality we are looking for and with the current tests being run and the new CI job we are testing basic functionality and validating compute, dataset, array are working without numpy. I am marking this PR as ready for review now.

Jun 25 '24 12:06 raulcd

The CI test faiures are unrelated (mainly due to: https://github.com/apache/arrow/issues/42149)

Jun 25 '24 12:06 raulcd

I'm really -1 on the current testing approach where everything is disabled by default and tests have to be whitelisted twice to be executed.

Jun 26 '24 13:06 pitrou

tests have to be whitelisted twice

We could also only do the file selection, and for those files remove the marker (which means that either all tests in their have to run without numpy, or be marked with pytest.mark.numpy)

And then we can in follow-ups gradually make that list of files larger until it's all test files and we can simply run pytest without a specific file selection.

Jun 26 '24 13:06 jorisvandenbossche

I'm really -1 on the current testing approach where everything is disabled by default and tests have to be whitelisted twice to be executed.

Thanks @pitrou as pointed on the PR this was just an initial (possible) solution to have some tests being run, I am ok if we want to remove the without_numpy mark but I would still prefer to only run some specific modules.

Jun 26 '24 13:06 raulcd

I still have to do a couple of things before is ready for review again but after going over all tests we have 190 tests marked as requiring numpy:

~/code/arrow/python  (GH-25118)  $ grep -R "@pytest.mark.numpy" . | wc -l
190

Even though it seems a lot we have 268 for pandas:

~/code/arrow/python  (GH-25118)  $ grep -R "@pytest.mark.pandas" . | wc -l
268

With those changes we are running the following tests without numpy:

2258 passed, 934 skipped, 17 xfailed, 1 xpassed in 136.27s

Jun 26 '24 16:06 raulcd

@pitrou @jorisvandenbossche I've gone through all the review comments and acted upon them, rebased and fixed some minor related CI issues. Both the jobs with and without numpy scans and run all relevant tests. We only have two marks: numpy and nonumpy which work the same as it does with the pandas and nopandas today. This is ready for a new round of review. Thanks for your help!

Jul 23 '24 11:07 raulcd

@github-actions crossbow submit -g python

Jul 30 '24 13:07 jorisvandenbossche

Revision: 66120c1ff9afb57d302f14da28bc0c6569abd2a1

Submitted crossbow builds: ursacomputing/crossbow @ actions-78d037afc2

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-cython2
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-1.26
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.10-pandas-nightly-numpy-nightly
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.8
test-conda-python-3.8-pandas-1.0-numpy-1.19
test-conda-python-3.9
test-conda-python-3.9-pandas-latest-numpy-latest
test-conda-python-emscripten
test-cuda-python
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-python-3
test-ubuntu-20.04-python-3
test-ubuntu-22.04-python-3

Jul 30 '24 13:07 github-actions[bot]

@github-actions crossbow submit test-conda-python-3.11-hypothesis

Aug 01 '24 11:08 jorisvandenbossche

Revision: b9fa212e4364c2c41728a05bb7ee2ff970cf917e

Submitted crossbow builds: ursacomputing/crossbow @ actions-20646cd2d7

Task	Status
test-conda-python-3.11-hypothesis

Aug 01 '24 11:08 github-actions[bot]

@github-actions crossbow submit -g python

Aug 01 '24 13:08 raulcd

Revision: 0e41898236632e04e7fe1f916e771b0a8783a0e4

Submitted crossbow builds: ursacomputing/crossbow @ actions-5d9ffc81f3

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-cython2
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-1.26
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.10-pandas-nightly-numpy-nightly
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.8
test-conda-python-3.8-pandas-1.0-numpy-1.19
test-conda-python-3.9
test-conda-python-3.9-pandas-latest-numpy-latest
test-conda-python-emscripten
test-cuda-python
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-python-3
test-ubuntu-20.04-python-3
test-ubuntu-22.04-python-3

Aug 01 '24 13:08 github-actions[bot]

CI test failures are unrelated

Aug 01 '24 14:08 raulcd

@github-actions crossbow submit test-conda-python-3.11-hypothesis

Aug 02 '24 09:08 raulcd

Revision: 980d2a7dd8ceb722fe9bc2d7d376fef8569750e3

Submitted crossbow builds: ursacomputing/crossbow @ actions-20ce23a13a

Task	Status
test-conda-python-3.11-hypothesis

Aug 02 '24 09:08 github-actions[bot]

@pitrou @jorisvandenbossche I think I've gone through all the comments on the PR. I will be off for some days since Thursday, let me know if there's anything else you want me to change before that :)

Aug 06 '24 16:08 raulcd

CI failures are unrelated. @jorisvandenbossche @pitrou I've gone through the last comments again. Sorry, I had to rebase in order to fix some issues. The last commits, from https://github.com/apache/arrow/pull/41904/commits/a566aeca944707f31f0ad93b06651948aca88174 onwards, are the new commits.

Aug 26 '24 17:08 raulcd

All good from my side!

While this PR does make it possible to use pyarrow at runtime without numpy installed, it does not actually makes the dependency optional, in a standard packaging sense. Do we want to discuss that in a follow-up issue whether we also want to actually remove numpy from the required dependencies list?

Aug 27 '24 13:08 jorisvandenbossche

Do we want to discuss that in a follow-up issue whether we also want to actually remove numpy from the required dependencies list?

I was planning to create a follow up issue where we remove numpy from the required dependencies list. I've just created: https://github.com/apache/arrow/issues/43846

Aug 27 '24 13:08 raulcd

I've rebased again to keep it up to date with main. @pitrou do you have any other comments / concerns before merging?

Aug 30 '24 08:08 raulcd

@github-actions crossbow submit -g wheel -g python

Sep 02 '24 13:09 pitrou

Revision: 08da867e92a82ddb20f9f73ac52b7c69dfb8b041

Submitted crossbow builds: ursacomputing/crossbow @ actions-e575e4ce7c

Task	Status
example-python-minimal-build-fedora-conda
example-python-minimal-build-ubuntu-venv
test-conda-python-3.10
test-conda-python-3.10-cython2
test-conda-python-3.10-hdfs-2.9.2
test-conda-python-3.10-hdfs-3.2.1
test-conda-python-3.10-pandas-latest-numpy-1.26
test-conda-python-3.10-pandas-latest-numpy-latest
test-conda-python-3.10-pandas-nightly-numpy-nightly
test-conda-python-3.10-substrait
test-conda-python-3.11
test-conda-python-3.11-dask-latest
test-conda-python-3.11-dask-upstream_devel
test-conda-python-3.11-hypothesis
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly
test-conda-python-3.11-spark-master
test-conda-python-3.12
test-conda-python-3.12-cpython-debug
test-conda-python-3.8
test-conda-python-3.8-pandas-1.0-numpy-1.19
test-conda-python-3.9
test-conda-python-3.9-pandas-latest-numpy-latest
test-conda-python-emscripten
test-cuda-python
test-debian-12-python-3-amd64
test-debian-12-python-3-i386
test-fedora-39-python-3
test-ubuntu-20.04-python-3
test-ubuntu-22.04-python-3
wheel-macos-monterey-cp310-amd64
wheel-macos-monterey-cp310-arm64
wheel-macos-monterey-cp311-amd64
wheel-macos-monterey-cp311-arm64
wheel-macos-monterey-cp312-amd64
wheel-macos-monterey-cp312-arm64
wheel-macos-monterey-cp313-amd64
wheel-macos-monterey-cp313-arm64
wheel-macos-monterey-cp38-amd64
wheel-macos-monterey-cp38-arm64
wheel-macos-monterey-cp39-amd64
wheel-macos-monterey-cp39-arm64
wheel-manylinux-2-28-cp310-amd64
wheel-manylinux-2-28-cp310-arm64
wheel-manylinux-2-28-cp311-amd64
wheel-manylinux-2-28-cp311-arm64
wheel-manylinux-2-28-cp312-amd64
wheel-manylinux-2-28-cp312-arm64
wheel-manylinux-2-28-cp313-amd64
wheel-manylinux-2-28-cp313-arm64
wheel-manylinux-2-28-cp38-amd64
wheel-manylinux-2-28-cp38-arm64
wheel-manylinux-2-28-cp39-amd64
wheel-manylinux-2-28-cp39-arm64
wheel-manylinux-2014-cp310-amd64
wheel-manylinux-2014-cp310-arm64
wheel-manylinux-2014-cp311-amd64
wheel-manylinux-2014-cp311-arm64
wheel-manylinux-2014-cp312-amd64
wheel-manylinux-2014-cp312-arm64
wheel-manylinux-2014-cp313-amd64
wheel-manylinux-2014-cp313-arm64
wheel-manylinux-2014-cp38-amd64
wheel-manylinux-2014-cp38-arm64
wheel-manylinux-2014-cp39-amd64
wheel-manylinux-2014-cp39-arm64
wheel-windows-cp310-amd64
wheel-windows-cp311-amd64
wheel-windows-cp312-amd64
wheel-windows-cp313-amd64
wheel-windows-cp38-amd64
wheel-windows-cp39-amd64

Sep 02 '24 13:09 github-actions[bot]

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 9ab9532a208d5632b0f8b5035a207235b5e6b828.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sep 02 '24 22:09 conbench-apache-arrow[bot]

Thanks @raulcd!

Sep 03 '24 08:09 jorisvandenbossche