GH-25118: [Python] Make NumPy an optional runtime dependency
Rationale for this change
Being able to run pyarrow without requiring numpy.
What changes are included in this PR?
If numpy is not present we are able to import pyarrow and run functionality. A new CI job has been created to run some basic tests without numpy.
Are these changes tested?
Yes via CI.
Are there any user-facing changes?
Yes, NumPy can be removed from the user installation and pyarrow functionality still works
- GitHub Issue: #25118
@jorisvandenbossche @pitrou As part of this initial PR I've decided to create a new mark pytest.mark.without_numpy (feel free to propose a different name) that will be used to collect tests on the CI job when numpy is not present. The mark won't affect the tests when numpy is present and those tests will also be run.
I think we can follow up test refactoring or splitting some test modules (like test_array.py) in order to be able to run the full test suite with and without numpy in a follow up PR instead of trying to add it into this PR. I think this PR is already adding the functionality we are looking for and with the current tests being run and the new CI job we are testing basic functionality and validating compute, dataset, array are working without numpy.
I am marking this PR as ready for review now.
The CI test faiures are unrelated (mainly due to: https://github.com/apache/arrow/issues/42149)
I'm really -1 on the current testing approach where everything is disabled by default and tests have to be whitelisted twice to be executed.
tests have to be whitelisted twice
We could also only do the file selection, and for those files remove the marker (which means that either all tests in their have to run without numpy, or be marked with pytest.mark.numpy)
And then we can in follow-ups gradually make that list of files larger until it's all test files and we can simply run pytest without a specific file selection.
I'm really -1 on the current testing approach where everything is disabled by default and tests have to be whitelisted twice to be executed.
Thanks @pitrou as pointed on the PR this was just an initial (possible) solution to have some tests being run, I am ok if we want to remove the without_numpy mark but I would still prefer to only run some specific modules.
I still have to do a couple of things before is ready for review again but after going over all tests we have 190 tests marked as requiring numpy:
~/code/arrow/python (GH-25118) $ grep -R "@pytest.mark.numpy" . | wc -l
190
Even though it seems a lot we have 268 for pandas:
~/code/arrow/python (GH-25118) $ grep -R "@pytest.mark.pandas" . | wc -l
268
With those changes we are running the following tests without numpy:
2258 passed, 934 skipped, 17 xfailed, 1 xpassed in 136.27s
@pitrou @jorisvandenbossche I've gone through all the review comments and acted upon them, rebased and fixed some minor related CI issues.
Both the jobs with and without numpy scans and run all relevant tests. We only have two marks: numpy and nonumpy which work the same as it does with the pandas and nopandas today.
This is ready for a new round of review.
Thanks for your help!
@github-actions crossbow submit -g python
Revision: 66120c1ff9afb57d302f14da28bc0c6569abd2a1
Submitted crossbow builds: ursacomputing/crossbow @ actions-78d037afc2
@github-actions crossbow submit test-conda-python-3.11-hypothesis
Revision: b9fa212e4364c2c41728a05bb7ee2ff970cf917e
Submitted crossbow builds: ursacomputing/crossbow @ actions-20646cd2d7
| Task | Status |
|---|---|
| test-conda-python-3.11-hypothesis |
@github-actions crossbow submit -g python
Revision: 0e41898236632e04e7fe1f916e771b0a8783a0e4
Submitted crossbow builds: ursacomputing/crossbow @ actions-5d9ffc81f3
CI test failures are unrelated
@github-actions crossbow submit test-conda-python-3.11-hypothesis
Revision: 980d2a7dd8ceb722fe9bc2d7d376fef8569750e3
Submitted crossbow builds: ursacomputing/crossbow @ actions-20ce23a13a
| Task | Status |
|---|---|
| test-conda-python-3.11-hypothesis |
@pitrou @jorisvandenbossche I think I've gone through all the comments on the PR. I will be off for some days since Thursday, let me know if there's anything else you want me to change before that :)
CI failures are unrelated. @jorisvandenbossche @pitrou I've gone through the last comments again. Sorry, I had to rebase in order to fix some issues. The last commits, from https://github.com/apache/arrow/pull/41904/commits/a566aeca944707f31f0ad93b06651948aca88174 onwards, are the new commits.
All good from my side!
While this PR does make it possible to use pyarrow at runtime without numpy installed, it does not actually makes the dependency optional, in a standard packaging sense. Do we want to discuss that in a follow-up issue whether we also want to actually remove numpy from the required dependencies list?
Do we want to discuss that in a follow-up issue whether we also want to actually remove numpy from the required dependencies list?
I was planning to create a follow up issue where we remove numpy from the required dependencies list. I've just created: https://github.com/apache/arrow/issues/43846
I've rebased again to keep it up to date with main. @pitrou do you have any other comments / concerns before merging?
@github-actions crossbow submit -g wheel -g python
Revision: 08da867e92a82ddb20f9f73ac52b7c69dfb8b041
Submitted crossbow builds: ursacomputing/crossbow @ actions-e575e4ce7c
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 9ab9532a208d5632b0f8b5035a207235b5e6b828.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.
Thanks @raulcd!