datasets Unblock NumPy 2.0

Fixes https://github.com/huggingface/datasets/issues/6980

Jun 22 '24 09:06 NeilGirdhar

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Jun 26 '24 13:06 HuggingFaceDocBuilderDev

@albertvillanova Any chance we could get this in before the next release? Everything depending on HuggingFace has their NumPy upgrade blocked.

Jul 10 '24 17:07 NeilGirdhar

The incompatible libraries are:

faiss-cpu 1.8.0.post1 requires numpy<2.0,>=1.0, but you have numpy 2.0.0 which is incompatible.
tensorflow 2.16.2 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.0.0 which is incompatible.
transformers 4.42.3 requires numpy<2.0,>=1.17, but you have numpy 2.0.0 which is incompatible.

Jul 11 '24 10:07 albertvillanova

Why is it installing numpy 2 if the dependencies don't support it?

Jul 11 '24 10:07 NeilGirdhar

For me, I'm getting:

❯ uv pip install --system "datasets[tests] @ ."
Found existing alias for "uv pip install". You should use: "pipi"
Resolved 119 packages in 934ms
   Built datasets @ file:///Users/neil/src/datasets
Prepared 1 package in 1.28s
Uninstalled 1 package in 10ms
Installed 2 packages in 17ms
 - datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
 + datasets==2.20.1.dev0 (from file:///Users/neil/src/datasets)
 + numpy==1.26.4

Jul 11 '24 10:07 NeilGirdhar

Which version on Python do you have?

Jul 11 '24 11:07 albertvillanova

3.12.4 I'll try on 3.10 now.

Jul 11 '24 11:07 NeilGirdhar

Please, note that I obtained the previous incompatible libraries in my local environment, by forcing the update of numpy.

Jul 11 '24 11:07 albertvillanova

In the Python 3.10 CI, the situation is different:

for example, they install an older version of tensorflow (2.14.0), where probably the constraint on numpy was not yet implemented. See the details: https://github.com/huggingface/datasets/actions/runs/9879100332/job/27306903343?pr=6991

> uv pip install --system "datasets[tests] @ ."
...
 + faiss-cpu==1.8.0
...
 + numpy==2.0.0
...
 + tensorflow==2.14.0

See, CI installs:

faiss-cpu 1.8.0 instead of 1.8.0.post1
tensorflow 2.14.0 instead of 2.16.2
transformers 4.41.2 instead of 4.42.3

Jul 11 '24 11:07 albertvillanova

~~The main point is that we cannot support numpy 2.0 until tensorflow and faiss do.~~

Alternatively, we should ignore/select tests depending on the installed versions.

Jul 11 '24 11:07 albertvillanova

Alternatively, we should ignore/select tests depending on the installed versions.

That works.

Alternatively, you could depend on tensorflow >= 2.16.2 (etc.) for the tests?

Jul 11 '24 11:07 NeilGirdhar

Yes, I was thinking of a workaround solution.

The issue I see is that our CI will not test numpy 2.0 indeed.

Jul 11 '24 11:07 albertvillanova

The issue I see is that our CI will not test numpy 2.0 indeed.

Right, that's the advantage of the test skipping you wanted, I see your point.

Thing is, it won't be long before tensorflow supports numpy 2.0, and then the situation is resolved and your tests test numpy 2.0. Do you really want to invest a lot of effort into testing numpy 2.0 for a few months benefit?

Jul 11 '24 11:07 NeilGirdhar

Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.

Jul 11 '24 12:07 albertvillanova

Without testing Numpy 2.0, we do not know if there are some other parts in the code broken.

Yes, you're right. I understand you're point, but you could say this for anything that your test dependencies don't support.

I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy. You could write some Jax tests for example.

That said, blocking numpy 2 isn't a good solution in my opinion. These dependencies are extremely late in supporting Numpy 2. They were supposed to be testing against preview releases over three months ago. I don't think the world should have to wait for them.

Jul 11 '24 12:07 NeilGirdhar

I guess the solution is to write tests that don't depend on tensorflow, etc., but still use numpy. That is my point. What we cannot do is just blindly support Numpy 2.0 without knowing its consequences. We need to test it:

to know if our core code works with it
to know what optional libraries are incompatible

For example, while testing locally, I have discovered that librosa is also incompatible with numpy-2.0, due to its dependency on soxr:

https://github.com/dofuuz/python-soxr/issues/28

Jul 12 '24 05:07 albertvillanova

While testing locally, I have also discovered that pytorch does not support Numpy 2.0 on Windows platforms:

https://github.com/pytorch/pytorch/issues/128860

Jul 12 '24 06:07 albertvillanova

I am adding Numpy 2.0 tests to your PR if you don't mind, before merging this PR.

Jul 12 '24 10:07 albertvillanova

Awesome, thank you! Please let me know if I need to do anything.

Jul 12 '24 10:07 NeilGirdhar

Now we test numpy 2.0 in the test_py310_numpy2 CI tests: https://github.com/huggingface/datasets/actions/runs/9907254874/job/27370545495?pr=6991

 + numpy==2.0.0

Jul 12 '24 11:07 albertvillanova

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005709 / 0.011353 (-0.005643)	0.003947 / 0.011008 (-0.007061)	0.064407 / 0.038508 (0.025899)	0.029903 / 0.023109 (0.006794)	0.244838 / 0.275898 (-0.031060)	0.268894 / 0.323480 (-0.054586)	0.003200 / 0.007986 (-0.004786)	0.002867 / 0.004328 (-0.001461)	0.050016 / 0.004250 (0.045765)	0.047682 / 0.037052 (0.010629)	0.252186 / 0.258489 (-0.006303)	0.292050 / 0.293841 (-0.001791)	0.030277 / 0.128546 (-0.098270)	0.012283 / 0.075646 (-0.063364)	0.205875 / 0.419271 (-0.213397)	0.037202 / 0.043533 (-0.006331)	0.246045 / 0.255139 (-0.009094)	0.272422 / 0.283200 (-0.010777)	0.020572 / 0.141683 (-0.121111)	1.114343 / 1.452155 (-0.337812)	1.169909 / 1.492716 (-0.322808)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.096612 / 0.018006 (0.078605)	0.303025 / 0.000490 (0.302535)	0.000210 / 0.000200 (0.000010)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019292 / 0.037411 (-0.018119)	0.062548 / 0.014526 (0.048023)	0.076027 / 0.176557 (-0.100530)	0.121752 / 0.737135 (-0.615383)	0.076608 / 0.296338 (-0.219730)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.283900 / 0.215209 (0.068691)	2.829829 / 2.077655 (0.752174)	1.428934 / 1.504120 (-0.075186)	1.316796 / 1.541195 (-0.224399)	1.330012 / 1.468490 (-0.138478)	0.702245 / 4.584777 (-3.882532)	2.380454 / 3.745712 (-1.365259)	2.882881 / 5.269862 (-2.386980)	1.920345 / 4.565676 (-2.645332)	0.077860 / 0.424275 (-0.346415)	0.005295 / 0.007607 (-0.002312)	0.336968 / 0.226044 (0.110924)	3.327808 / 2.268929 (1.058879)	1.781958 / 55.444624 (-53.662666)	1.489412 / 6.876477 (-5.387065)	1.634829 / 2.142072 (-0.507243)	0.787985 / 4.805227 (-4.017243)	0.134397 / 6.500664 (-6.366267)	0.042906 / 0.075469 (-0.032563)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.967647 / 1.841788 (-0.874141)	11.714541 / 8.074308 (3.640233)	9.350228 / 10.191392 (-0.841164)	0.142675 / 0.680424 (-0.537749)	0.014609 / 0.534201 (-0.519592)	0.301970 / 0.579283 (-0.277314)	0.262350 / 0.434364 (-0.172014)	0.342933 / 0.540337 (-0.197404)	0.437321 / 1.386936 (-0.949615)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005622 / 0.011353 (-0.005731)	0.003958 / 0.011008 (-0.007050)	0.050667 / 0.038508 (0.012159)	0.032842 / 0.023109 (0.009733)	0.252292 / 0.275898 (-0.023606)	0.280602 / 0.323480 (-0.042878)	0.004313 / 0.007986 (-0.003673)	0.002870 / 0.004328 (-0.001458)	0.049549 / 0.004250 (0.045299)	0.040448 / 0.037052 (0.003396)	0.270264 / 0.258489 (0.011775)	0.302988 / 0.293841 (0.009147)	0.030840 / 0.128546 (-0.097707)	0.012131 / 0.075646 (-0.063515)	0.060061 / 0.419271 (-0.359211)	0.033025 / 0.043533 (-0.010507)	0.251909 / 0.255139 (-0.003230)	0.275511 / 0.283200 (-0.007689)	0.018399 / 0.141683 (-0.123284)	1.160744 / 1.452155 (-0.291411)	1.188265 / 1.492716 (-0.304452)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.097719 / 0.018006 (0.079712)	0.304389 / 0.000490 (0.303899)	0.000217 / 0.000200 (0.000017)	0.000045 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022964 / 0.037411 (-0.014447)	0.076897 / 0.014526 (0.062372)	0.088930 / 0.176557 (-0.087626)	0.128926 / 0.737135 (-0.608209)	0.091049 / 0.296338 (-0.205290)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.285670 / 0.215209 (0.070461)	2.806071 / 2.077655 (0.728416)	1.527161 / 1.504120 (0.023041)	1.410291 / 1.541195 (-0.130903)	1.427071 / 1.468490 (-0.041419)	0.705527 / 4.584777 (-3.879250)	0.926915 / 3.745712 (-2.818797)	2.893078 / 5.269862 (-2.376784)	1.907113 / 4.565676 (-2.658564)	0.077326 / 0.424275 (-0.346949)	0.005182 / 0.007607 (-0.002425)	0.332282 / 0.226044 (0.106237)	3.312889 / 2.268929 (1.043960)	1.853839 / 55.444624 (-53.590785)	1.592013 / 6.876477 (-5.284464)	1.620234 / 2.142072 (-0.521838)	0.776894 / 4.805227 (-4.028333)	0.132411 / 6.500664 (-6.368253)	0.041430 / 0.075469 (-0.034039)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.003468 / 1.841788 (-0.838320)	12.472251 / 8.074308 (4.397943)	10.603243 / 10.191392 (0.411851)	0.132561 / 0.680424 (-0.547863)	0.015790 / 0.534201 (-0.518411)	0.306724 / 0.579283 (-0.272559)	0.125812 / 0.434364 (-0.308552)	0.343782 / 0.540337 (-0.196555)	0.445915 / 1.386936 (-0.941021)

Jul 12 '24 12:07 github-actions[bot]

datasets datasets copied to clipboard

Unblock NumPy 2.0

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

datasets
datasets copied to clipboard