datasets
datasets copied to clipboard
Unblock NumPy 2.0
Fixes https://github.com/huggingface/datasets/issues/6980
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@albertvillanova Any chance we could get this in before the next release? Everything depending on HuggingFace has their NumPy upgrade blocked.
The incompatible libraries are:
- faiss-cpu 1.8.0.post1 requires numpy<2.0,>=1.0, but you have numpy 2.0.0 which is incompatible.
- tensorflow 2.16.2 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.0.0 which is incompatible.
- transformers 4.42.3 requires numpy<2.0,>=1.17, but you have numpy 2.0.0 which is incompatible.
Why is it installing numpy 2 if the dependencies don't support it?
For me, I'm getting:
❯ uv pip install --system "datasets[tests] @ ."
Found existing alias for "uv pip install". You should use: "pipi"
Resolved 119 packages in 934ms
Built datasets @ file:///Users/neil/src/datasets
Prepared 1 package in 1.28s
Uninstalled 1 package in 10ms
Installed 2 packages in 17ms
- datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
+ datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
+ numpy==1.26.4
Which version on Python do you have?
3.12.4 I'll try on 3.10 now.
Please, note that I obtained the previous incompatible libraries in my local environment, by forcing the update of numpy.
In the Python 3.10 CI, the situation is different:
- for example, they install an older version of tensorflow (2.14.0), where probably the constraint on numpy was not yet implemented. See the details: https://github.com/huggingface/datasets/actions/runs/9879100332/job/27306903343?pr=6991
> uv pip install --system "datasets[tests] @ ."
...
+ faiss-cpu==1.8.0
...
+ numpy==2.0.0
...
+ tensorflow==2.14.0
See, CI installs:
- faiss-cpu 1.8.0 instead of 1.8.0.post1
- tensorflow 2.14.0 instead of 2.16.2
- transformers 4.41.2 instead of 4.42.3
~~The main point is that we cannot support numpy 2.0 until tensorflow and faiss do.~~
Alternatively, we should ignore/select tests depending on the installed versions.
Alternatively, we should ignore/select tests depending on the installed versions.
That works.
Alternatively, you could depend on tensorflow >= 2.16.2 (etc.) for the tests?
Yes, I was thinking of a workaround solution.
The issue I see is that our CI will not test numpy 2.0 indeed.
The issue I see is that our CI will not test numpy 2.0 indeed.
Right, that's the advantage of the test skipping you wanted, I see your point.
Thing is, it won't be long before tensorflow supports numpy 2.0, and then the situation is resolved and your tests test numpy 2.0. Do you really want to invest a lot of effort into testing numpy 2.0 for a few months benefit?
Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.
Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.
Yes, you're right. I understand you're point, but you could say this for anything that your test dependencies don't support.
I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy. You could write some Jax tests for example.
That said, blocking numpy 2 isn't a good solution in my opinion. These dependencies are extremely late in supporting Numpy 2. They were supposed to be testing against preview releases over three months ago. I don't think the world should have to wait for them.
I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy. That is my point. What we cannot do is just blindly support Numpy 2.0 without knowing its consequences. We need to test it:
- to know if our core code works with it
- to know what optional libraries are incompatible
For example, while testing locally, I have discovered that librosa is also incompatible with numpy-2.0, due to its dependency on soxr:
- https://github.com/dofuuz/python-soxr/issues/28
While testing locally, I have also discovered that pytorch does not support Numpy 2.0 on Windows platforms:
- https://github.com/pytorch/pytorch/issues/128860
I am adding Numpy 2.0 tests to your PR if you don't mind, before merging this PR.
Awesome, thank you! Please let me know if I need to do anything.
Now we test numpy 2.0 in the test_py310_numpy2 CI tests: https://github.com/huggingface/datasets/actions/runs/9907254874/job/27370545495?pr=6991
+ numpy==2.0.0
Show benchmarks
PyArrow==8.0.0
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
| metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| new / old (diff) | 0.005709 / 0.011353 (-0.005643) | 0.003947 / 0.011008 (-0.007061) | 0.064407 / 0.038508 (0.025899) | 0.029903 / 0.023109 (0.006794) | 0.244838 / 0.275898 (-0.031060) | 0.268894 / 0.323480 (-0.054586) | 0.003200 / 0.007986 (-0.004786) | 0.002867 / 0.004328 (-0.001461) | 0.050016 / 0.004250 (0.045765) | 0.047682 / 0.037052 (0.010629) | 0.252186 / 0.258489 (-0.006303) | 0.292050 / 0.293841 (-0.001791) | 0.030277 / 0.128546 (-0.098270) | 0.012283 / 0.075646 (-0.063364) | 0.205875 / 0.419271 (-0.213397) | 0.037202 / 0.043533 (-0.006331) | 0.246045 / 0.255139 (-0.009094) | 0.272422 / 0.283200 (-0.010777) | 0.020572 / 0.141683 (-0.121111) | 1.114343 / 1.452155 (-0.337812) | 1.169909 / 1.492716 (-0.322808) |
Benchmark: benchmark_getitem_100B.json
| metric | get_batch_of_1024_random_rows | get_batch_of_1024_rows | get_first_row | get_last_row |
|---|---|---|---|---|
| new / old (diff) | 0.096612 / 0.018006 (0.078605) | 0.303025 / 0.000490 (0.302535) | 0.000210 / 0.000200 (0.000010) | 0.000043 / 0.000054 (-0.000011) |
Benchmark: benchmark_indices_mapping.json
| metric | select | shard | shuffle | sort | train_test_split |
|---|---|---|---|---|---|
| new / old (diff) | 0.019292 / 0.037411 (-0.018119) | 0.062548 / 0.014526 (0.048023) | 0.076027 / 0.176557 (-0.100530) | 0.121752 / 0.737135 (-0.615383) | 0.076608 / 0.296338 (-0.219730) |
Benchmark: benchmark_iterating.json
| metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| new / old (diff) | 0.283900 / 0.215209 (0.068691) | 2.829829 / 2.077655 (0.752174) | 1.428934 / 1.504120 (-0.075186) | 1.316796 / 1.541195 (-0.224399) | 1.330012 / 1.468490 (-0.138478) | 0.702245 / 4.584777 (-3.882532) | 2.380454 / 3.745712 (-1.365259) | 2.882881 / 5.269862 (-2.386980) | 1.920345 / 4.565676 (-2.645332) | 0.077860 / 0.424275 (-0.346415) | 0.005295 / 0.007607 (-0.002312) | 0.336968 / 0.226044 (0.110924) | 3.327808 / 2.268929 (1.058879) | 1.781958 / 55.444624 (-53.662666) | 1.489412 / 6.876477 (-5.387065) | 1.634829 / 2.142072 (-0.507243) | 0.787985 / 4.805227 (-4.017243) | 0.134397 / 6.500664 (-6.366267) | 0.042906 / 0.075469 (-0.032563) |
Benchmark: benchmark_map_filter.json
| metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow |
|---|---|---|---|---|---|---|---|---|---|
| new / old (diff) | 0.967647 / 1.841788 (-0.874141) | 11.714541 / 8.074308 (3.640233) | 9.350228 / 10.191392 (-0.841164) | 0.142675 / 0.680424 (-0.537749) | 0.014609 / 0.534201 (-0.519592) | 0.301970 / 0.579283 (-0.277314) | 0.262350 / 0.434364 (-0.172014) | 0.342933 / 0.540337 (-0.197404) | 0.437321 / 1.386936 (-0.949615) |
Show updated benchmarks!
Benchmark: benchmark_array_xd.json
| metric | read_batch_formatted_as_numpy after write_array2d | read_batch_formatted_as_numpy after write_flattened_sequence | read_batch_formatted_as_numpy after write_nested_sequence | read_batch_unformated after write_array2d | read_batch_unformated after write_flattened_sequence | read_batch_unformated after write_nested_sequence | read_col_formatted_as_numpy after write_array2d | read_col_formatted_as_numpy after write_flattened_sequence | read_col_formatted_as_numpy after write_nested_sequence | read_col_unformated after write_array2d | read_col_unformated after write_flattened_sequence | read_col_unformated after write_nested_sequence | read_formatted_as_numpy after write_array2d | read_formatted_as_numpy after write_flattened_sequence | read_formatted_as_numpy after write_nested_sequence | read_unformated after write_array2d | read_unformated after write_flattened_sequence | read_unformated after write_nested_sequence | write_array2d | write_flattened_sequence | write_nested_sequence |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| new / old (diff) | 0.005622 / 0.011353 (-0.005731) | 0.003958 / 0.011008 (-0.007050) | 0.050667 / 0.038508 (0.012159) | 0.032842 / 0.023109 (0.009733) | 0.252292 / 0.275898 (-0.023606) | 0.280602 / 0.323480 (-0.042878) | 0.004313 / 0.007986 (-0.003673) | 0.002870 / 0.004328 (-0.001458) | 0.049549 / 0.004250 (0.045299) | 0.040448 / 0.037052 (0.003396) | 0.270264 / 0.258489 (0.011775) | 0.302988 / 0.293841 (0.009147) | 0.030840 / 0.128546 (-0.097707) | 0.012131 / 0.075646 (-0.063515) | 0.060061 / 0.419271 (-0.359211) | 0.033025 / 0.043533 (-0.010507) | 0.251909 / 0.255139 (-0.003230) | 0.275511 / 0.283200 (-0.007689) | 0.018399 / 0.141683 (-0.123284) | 1.160744 / 1.452155 (-0.291411) | 1.188265 / 1.492716 (-0.304452) |
Benchmark: benchmark_getitem_100B.json
| metric | get_batch_of_1024_random_rows | get_batch_of_1024_rows | get_first_row | get_last_row |
|---|---|---|---|---|
| new / old (diff) | 0.097719 / 0.018006 (0.079712) | 0.304389 / 0.000490 (0.303899) | 0.000217 / 0.000200 (0.000017) | 0.000045 / 0.000054 (-0.000010) |
Benchmark: benchmark_indices_mapping.json
| metric | select | shard | shuffle | sort | train_test_split |
|---|---|---|---|---|---|
| new / old (diff) | 0.022964 / 0.037411 (-0.014447) | 0.076897 / 0.014526 (0.062372) | 0.088930 / 0.176557 (-0.087626) | 0.128926 / 0.737135 (-0.608209) | 0.091049 / 0.296338 (-0.205290) |
Benchmark: benchmark_iterating.json
| metric | read 5000 | read 50000 | read_batch 50000 10 | read_batch 50000 100 | read_batch 50000 1000 | read_formatted numpy 5000 | read_formatted pandas 5000 | read_formatted tensorflow 5000 | read_formatted torch 5000 | read_formatted_batch numpy 5000 10 | read_formatted_batch numpy 5000 1000 | shuffled read 5000 | shuffled read 50000 | shuffled read_batch 50000 10 | shuffled read_batch 50000 100 | shuffled read_batch 50000 1000 | shuffled read_formatted numpy 5000 | shuffled read_formatted_batch numpy 5000 10 | shuffled read_formatted_batch numpy 5000 1000 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| new / old (diff) | 0.285670 / 0.215209 (0.070461) | 2.806071 / 2.077655 (0.728416) | 1.527161 / 1.504120 (0.023041) | 1.410291 / 1.541195 (-0.130903) | 1.427071 / 1.468490 (-0.041419) | 0.705527 / 4.584777 (-3.879250) | 0.926915 / 3.745712 (-2.818797) | 2.893078 / 5.269862 (-2.376784) | 1.907113 / 4.565676 (-2.658564) | 0.077326 / 0.424275 (-0.346949) | 0.005182 / 0.007607 (-0.002425) | 0.332282 / 0.226044 (0.106237) | 3.312889 / 2.268929 (1.043960) | 1.853839 / 55.444624 (-53.590785) | 1.592013 / 6.876477 (-5.284464) | 1.620234 / 2.142072 (-0.521838) | 0.776894 / 4.805227 (-4.028333) | 0.132411 / 6.500664 (-6.368253) | 0.041430 / 0.075469 (-0.034039) |
Benchmark: benchmark_map_filter.json
| metric | filter | map fast-tokenizer batched | map identity | map identity batched | map no-op batched | map no-op batched numpy | map no-op batched pandas | map no-op batched pytorch | map no-op batched tensorflow |
|---|---|---|---|---|---|---|---|---|---|
| new / old (diff) | 1.003468 / 1.841788 (-0.838320) | 12.472251 / 8.074308 (4.397943) | 10.603243 / 10.191392 (0.411851) | 0.132561 / 0.680424 (-0.547863) | 0.015790 / 0.534201 (-0.518411) | 0.306724 / 0.579283 (-0.272559) | 0.125812 / 0.434364 (-0.308552) | 0.343782 / 0.540337 (-0.196555) | 0.445915 / 1.386936 (-0.941021) |
