datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Unblock NumPy 2.0

Open NeilGirdhar opened this issue 1 year ago • 1 comments

Fixes https://github.com/huggingface/datasets/issues/6980

NeilGirdhar avatar Jun 22 '24 09:06 NeilGirdhar

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@albertvillanova Any chance we could get this in before the next release? Everything depending on HuggingFace has their NumPy upgrade blocked.

NeilGirdhar avatar Jul 10 '24 17:07 NeilGirdhar

The incompatible libraries are:

  • faiss-cpu 1.8.0.post1 requires numpy<2.0,>=1.0, but you have numpy 2.0.0 which is incompatible.
  • tensorflow 2.16.2 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.0.0 which is incompatible.
  • transformers 4.42.3 requires numpy<2.0,>=1.17, but you have numpy 2.0.0 which is incompatible.

albertvillanova avatar Jul 11 '24 10:07 albertvillanova

Why is it installing numpy 2 if the dependencies don't support it?

NeilGirdhar avatar Jul 11 '24 10:07 NeilGirdhar

For me, I'm getting:

❯ uv pip install --system "datasets[tests] @ ."
Found existing alias for "uv pip install". You should use: "pipi"
Resolved 119 packages in 934ms
   Built datasets @ file:///Users/neil/src/datasets
Prepared 1 package in 1.28s
Uninstalled 1 package in 10ms
Installed 2 packages in 17ms
 - datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
 + datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
 + numpy==1.26.4

NeilGirdhar avatar Jul 11 '24 10:07 NeilGirdhar

Which version on Python do you have?

albertvillanova avatar Jul 11 '24 11:07 albertvillanova

3.12.4 I'll try on 3.10 now.

NeilGirdhar avatar Jul 11 '24 11:07 NeilGirdhar

Please, note that I obtained the previous incompatible libraries in my local environment, by forcing the update of numpy.

albertvillanova avatar Jul 11 '24 11:07 albertvillanova

In the Python 3.10 CI, the situation is different:

  • for example, they install an older version of tensorflow (2.14.0), where probably the constraint on numpy was not yet implemented. See the details: https://github.com/huggingface/datasets/actions/runs/9879100332/job/27306903343?pr=6991
> uv pip install --system "datasets[tests] @ ."
...
 + faiss-cpu==1.8.0
...
 + numpy==2.0.0
...
 + tensorflow==2.14.0

See, CI installs:

  • faiss-cpu 1.8.0 instead of 1.8.0.post1
  • tensorflow 2.14.0 instead of 2.16.2
  • transformers 4.41.2 instead of 4.42.3

albertvillanova avatar Jul 11 '24 11:07 albertvillanova

~~The main point is that we cannot support numpy 2.0 until tensorflow and faiss do.~~

Alternatively, we should ignore/select tests depending on the installed versions.

albertvillanova avatar Jul 11 '24 11:07 albertvillanova

Alternatively, we should ignore/select tests depending on the installed versions.

That works.

Alternatively, you could depend on tensorflow >= 2.16.2 (etc.) for the tests?

NeilGirdhar avatar Jul 11 '24 11:07 NeilGirdhar

Yes, I was thinking of a workaround solution.

The issue I see is that our CI will not test numpy 2.0 indeed.

albertvillanova avatar Jul 11 '24 11:07 albertvillanova

The issue I see is that our CI will not test numpy 2.0 indeed.

Right, that's the advantage of the test skipping you wanted, I see your point.

Thing is, it won't be long before tensorflow supports numpy 2.0, and then the situation is resolved and your tests test numpy 2.0. Do you really want to invest a lot of effort into testing numpy 2.0 for a few months benefit?

NeilGirdhar avatar Jul 11 '24 11:07 NeilGirdhar

Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.

albertvillanova avatar Jul 11 '24 12:07 albertvillanova

Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.

Yes, you're right. I understand you're point, but you could say this for anything that your test dependencies don't support.

I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy. You could write some Jax tests for example.

That said, blocking numpy 2 isn't a good solution in my opinion. These dependencies are extremely late in supporting Numpy 2. They were supposed to be testing against preview releases over three months ago. I don't think the world should have to wait for them.

NeilGirdhar avatar Jul 11 '24 12:07 NeilGirdhar

I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy. That is my point. What we cannot do is just blindly support Numpy 2.0 without knowing its consequences. We need to test it:

  • to know if our core code works with it
  • to know what optional libraries are incompatible

For example, while testing locally, I have discovered that librosa is also incompatible with numpy-2.0, due to its dependency on soxr:

  • https://github.com/dofuuz/python-soxr/issues/28

albertvillanova avatar Jul 12 '24 05:07 albertvillanova

While testing locally, I have also discovered that pytorch does not support Numpy 2.0 on Windows platforms:

  • https://github.com/pytorch/pytorch/issues/128860

albertvillanova avatar Jul 12 '24 06:07 albertvillanova

I am adding Numpy 2.0 tests to your PR if you don't mind, before merging this PR.

albertvillanova avatar Jul 12 '24 10:07 albertvillanova

Awesome, thank you! Please let me know if I need to do anything.

NeilGirdhar avatar Jul 12 '24 10:07 NeilGirdhar

Now we test numpy 2.0 in the test_py310_numpy2 CI tests: https://github.com/huggingface/datasets/actions/runs/9907254874/job/27370545495?pr=6991

 + numpy==2.0.0

albertvillanova avatar Jul 12 '24 11:07 albertvillanova

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005709 / 0.011353 (-0.005643) 0.003947 / 0.011008 (-0.007061) 0.064407 / 0.038508 (0.025899) 0.029903 / 0.023109 (0.006794) 0.244838 / 0.275898 (-0.031060) 0.268894 / 0.323480 (-0.054586) 0.003200 / 0.007986 (-0.004786) 0.002867 / 0.004328 (-0.001461) 0.050016 / 0.004250 (0.045765) 0.047682 / 0.037052 (0.010629) 0.252186 / 0.258489 (-0.006303) 0.292050 / 0.293841 (-0.001791) 0.030277 / 0.128546 (-0.098270) 0.012283 / 0.075646 (-0.063364) 0.205875 / 0.419271 (-0.213397) 0.037202 / 0.043533 (-0.006331) 0.246045 / 0.255139 (-0.009094) 0.272422 / 0.283200 (-0.010777) 0.020572 / 0.141683 (-0.121111) 1.114343 / 1.452155 (-0.337812) 1.169909 / 1.492716 (-0.322808)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.096612 / 0.018006 (0.078605) 0.303025 / 0.000490 (0.302535) 0.000210 / 0.000200 (0.000010) 0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.019292 / 0.037411 (-0.018119) 0.062548 / 0.014526 (0.048023) 0.076027 / 0.176557 (-0.100530) 0.121752 / 0.737135 (-0.615383) 0.076608 / 0.296338 (-0.219730)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.283900 / 0.215209 (0.068691) 2.829829 / 2.077655 (0.752174) 1.428934 / 1.504120 (-0.075186) 1.316796 / 1.541195 (-0.224399) 1.330012 / 1.468490 (-0.138478) 0.702245 / 4.584777 (-3.882532) 2.380454 / 3.745712 (-1.365259) 2.882881 / 5.269862 (-2.386980) 1.920345 / 4.565676 (-2.645332) 0.077860 / 0.424275 (-0.346415) 0.005295 / 0.007607 (-0.002312) 0.336968 / 0.226044 (0.110924) 3.327808 / 2.268929 (1.058879) 1.781958 / 55.444624 (-53.662666) 1.489412 / 6.876477 (-5.387065) 1.634829 / 2.142072 (-0.507243) 0.787985 / 4.805227 (-4.017243) 0.134397 / 6.500664 (-6.366267) 0.042906 / 0.075469 (-0.032563)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 0.967647 / 1.841788 (-0.874141) 11.714541 / 8.074308 (3.640233) 9.350228 / 10.191392 (-0.841164) 0.142675 / 0.680424 (-0.537749) 0.014609 / 0.534201 (-0.519592) 0.301970 / 0.579283 (-0.277314) 0.262350 / 0.434364 (-0.172014) 0.342933 / 0.540337 (-0.197404) 0.437321 / 1.386936 (-0.949615)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005622 / 0.011353 (-0.005731) 0.003958 / 0.011008 (-0.007050) 0.050667 / 0.038508 (0.012159) 0.032842 / 0.023109 (0.009733) 0.252292 / 0.275898 (-0.023606) 0.280602 / 0.323480 (-0.042878) 0.004313 / 0.007986 (-0.003673) 0.002870 / 0.004328 (-0.001458) 0.049549 / 0.004250 (0.045299) 0.040448 / 0.037052 (0.003396) 0.270264 / 0.258489 (0.011775) 0.302988 / 0.293841 (0.009147) 0.030840 / 0.128546 (-0.097707) 0.012131 / 0.075646 (-0.063515) 0.060061 / 0.419271 (-0.359211) 0.033025 / 0.043533 (-0.010507) 0.251909 / 0.255139 (-0.003230) 0.275511 / 0.283200 (-0.007689) 0.018399 / 0.141683 (-0.123284) 1.160744 / 1.452155 (-0.291411) 1.188265 / 1.492716 (-0.304452)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.097719 / 0.018006 (0.079712) 0.304389 / 0.000490 (0.303899) 0.000217 / 0.000200 (0.000017) 0.000045 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.022964 / 0.037411 (-0.014447) 0.076897 / 0.014526 (0.062372) 0.088930 / 0.176557 (-0.087626) 0.128926 / 0.737135 (-0.608209) 0.091049 / 0.296338 (-0.205290)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.285670 / 0.215209 (0.070461) 2.806071 / 2.077655 (0.728416) 1.527161 / 1.504120 (0.023041) 1.410291 / 1.541195 (-0.130903) 1.427071 / 1.468490 (-0.041419) 0.705527 / 4.584777 (-3.879250) 0.926915 / 3.745712 (-2.818797) 2.893078 / 5.269862 (-2.376784) 1.907113 / 4.565676 (-2.658564) 0.077326 / 0.424275 (-0.346949) 0.005182 / 0.007607 (-0.002425) 0.332282 / 0.226044 (0.106237) 3.312889 / 2.268929 (1.043960) 1.853839 / 55.444624 (-53.590785) 1.592013 / 6.876477 (-5.284464) 1.620234 / 2.142072 (-0.521838) 0.776894 / 4.805227 (-4.028333) 0.132411 / 6.500664 (-6.368253) 0.041430 / 0.075469 (-0.034039)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.003468 / 1.841788 (-0.838320) 12.472251 / 8.074308 (4.397943) 10.603243 / 10.191392 (0.411851) 0.132561 / 0.680424 (-0.547863) 0.015790 / 0.534201 (-0.518411) 0.306724 / 0.579283 (-0.272559) 0.125812 / 0.434364 (-0.308552) 0.343782 / 0.540337 (-0.196555) 0.445915 / 1.386936 (-0.941021)

github-actions[bot] avatar Jul 12 '24 12:07 github-actions[bot]