datasets Unpin hfh

Needed to use those in dataset-viewer:

dev version of hfh https://github.com/huggingface/dataset-viewer/pull/2781: don't span the hub with /paths-info requests
dev version of datasets at https://github.com/huggingface/datasets/pull/6875: don't write too big logs in the viewer

close https://github.com/huggingface/datasets/issues/6863

May 06 '24 18:05 lhoestq

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

May 06 '24 18:05 HuggingFaceDocBuilderDev

transformers 4.40.2 was release yesterday but not sure if it contains the fix

May 07 '24 12:05 lhoestq

@lhoestq yes I knew transformers 4.40.2 was released yesterday, but I had checked that it does not contain the fix: only 2 bug fixes. That is why our CI continues failing in this PR. We will have to wait until the next minor version.

May 07 '24 13:05 albertvillanova

If we urgently need some dev feature for dataset-viewer, I would suggest pushing the feature (cherry-picked) to a dedicated branch with 2.19.1 as its starting point (without opening a PR), and install datasets from that branch.

I have done so:

Created a branch from 2.19.1: https://github.com/huggingface/datasets/tree/datasets-2.19.1-hotfix
Cherry-picked the commit in this PR: https://github.com/huggingface/datasets/commit/3638183e2f7e0dce8924e46e7cc21bf6d5d7adfb
Opened a PR in dataset-viewer to update datasets to this revision: https://github.com/huggingface/dataset-viewer/pull/2783

May 07 '24 13:05 albertvillanova

hfh 0.23.1 and transformers 4.41.0 as are out out, let's unpin no ?

May 22 '24 16:05 lhoestq

I have re-run the CI to check that is green before.

May 23 '24 05:05 albertvillanova

The errors were coming from transformers having FutureWarning when loading models or tokenizers. I disabled the warnings for the transformers-related calls since they're not related to datasets

May 23 '24 13:05 lhoestq

I opened an issue in transformers:

https://github.com/huggingface/transformers/issues/31002

May 24 '24 07:05 albertvillanova

It's because the error from the FutureWarning happened when running cache_file() from transformers, which has some code that try/except and re-raise an OSError

May 24 '24 09:05 lhoestq

Opened https://github.com/huggingface/transformers/pull/31007 to fix the FutureWarning in transformers. Sorry, thought it was fixed by https://github.com/huggingface/transformers/issues/30618 but clearly an oversight from my side.

Regarding the pytest config, yes I remember adding it and in general I still think it's a good idea to have it. Will be more careful next time to update transformers before huggingface_hub's release and not the other way around (first time it happens since I've set this value :grimacing:). For a temporary fix in datasets I would rather temporarily disable the filterwarnings in datasets rather then adding filters in the test code.

May 24 '24 10:05 Wauplin

alright I disabled the errors on FutureWarning, do you see anything else @albertvillanova or we can merge ?

May 27 '24 09:05 lhoestq

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005165 / 0.011353 (-0.006188)	0.003991 / 0.011008 (-0.007017)	0.064029 / 0.038508 (0.025521)	0.031578 / 0.023109 (0.008468)	0.242646 / 0.275898 (-0.033252)	0.261834 / 0.323480 (-0.061646)	0.003032 / 0.007986 (-0.004953)	0.002659 / 0.004328 (-0.001670)	0.049868 / 0.004250 (0.045618)	0.047607 / 0.037052 (0.010555)	0.250537 / 0.258489 (-0.007952)	0.289460 / 0.293841 (-0.004381)	0.027225 / 0.128546 (-0.101321)	0.010496 / 0.075646 (-0.065151)	0.208455 / 0.419271 (-0.210816)	0.036813 / 0.043533 (-0.006720)	0.243361 / 0.255139 (-0.011778)	0.267477 / 0.283200 (-0.015723)	0.020402 / 0.141683 (-0.121281)	1.117118 / 1.452155 (-0.335037)	1.154868 / 1.492716 (-0.337849)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.096796 / 0.018006 (0.078790)	0.304588 / 0.000490 (0.304098)	0.000217 / 0.000200 (0.000017)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019221 / 0.037411 (-0.018190)	0.062897 / 0.014526 (0.048371)	0.076446 / 0.176557 (-0.100111)	0.124476 / 0.737135 (-0.612659)	0.079921 / 0.296338 (-0.216418)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.284442 / 0.215209 (0.069233)	2.799419 / 2.077655 (0.721764)	1.468022 / 1.504120 (-0.036098)	1.354013 / 1.541195 (-0.187182)	1.379985 / 1.468490 (-0.088505)	0.561723 / 4.584777 (-4.023054)	2.408887 / 3.745712 (-1.336825)	2.712591 / 5.269862 (-2.557271)	1.803132 / 4.565676 (-2.762544)	0.063010 / 0.424275 (-0.361265)	0.005030 / 0.007607 (-0.002577)	0.339065 / 0.226044 (0.113021)	3.373667 / 2.268929 (1.104738)	1.861569 / 55.444624 (-53.583056)	1.551357 / 6.876477 (-5.325120)	1.701885 / 2.142072 (-0.440187)	0.645685 / 4.805227 (-4.159543)	0.117915 / 6.500664 (-6.382749)	0.042656 / 0.075469 (-0.032814)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.957397 / 1.841788 (-0.884391)	11.544300 / 8.074308 (3.469992)	9.761814 / 10.191392 (-0.429578)	0.134766 / 0.680424 (-0.545658)	0.015387 / 0.534201 (-0.518814)	0.285692 / 0.579283 (-0.293591)	0.269201 / 0.434364 (-0.165163)	0.328198 / 0.540337 (-0.212140)	0.422315 / 1.386936 (-0.964621)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005333 / 0.011353 (-0.006020)	0.003638 / 0.011008 (-0.007370)	0.050503 / 0.038508 (0.011994)	0.032240 / 0.023109 (0.009130)	0.267602 / 0.275898 (-0.008296)	0.293125 / 0.323480 (-0.030355)	0.004275 / 0.007986 (-0.003710)	0.002714 / 0.004328 (-0.001615)	0.049341 / 0.004250 (0.045090)	0.040364 / 0.037052 (0.003311)	0.281096 / 0.258489 (0.022607)	0.312615 / 0.293841 (0.018774)	0.029981 / 0.128546 (-0.098565)	0.010230 / 0.075646 (-0.065416)	0.059218 / 0.419271 (-0.360054)	0.033360 / 0.043533 (-0.010172)	0.269518 / 0.255139 (0.014379)	0.287559 / 0.283200 (0.004360)	0.018159 / 0.141683 (-0.123524)	1.107148 / 1.452155 (-0.345006)	1.170731 / 1.492716 (-0.321985)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.095942 / 0.018006 (0.077936)	0.304914 / 0.000490 (0.304425)	0.000227 / 0.000200 (0.000027)	0.000051 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022609 / 0.037411 (-0.014803)	0.076455 / 0.014526 (0.061929)	0.088170 / 0.176557 (-0.088386)	0.128485 / 0.737135 (-0.608651)	0.092471 / 0.296338 (-0.203867)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.291471 / 0.215209 (0.076262)	2.822666 / 2.077655 (0.745012)	1.531679 / 1.504120 (0.027559)	1.405931 / 1.541195 (-0.135263)	1.418893 / 1.468490 (-0.049597)	0.576128 / 4.584777 (-4.008649)	0.969466 / 3.745712 (-2.776246)	2.831998 / 5.269862 (-2.437863)	1.788814 / 4.565676 (-2.776863)	0.064141 / 0.424275 (-0.360134)	0.005126 / 0.007607 (-0.002482)	0.341699 / 0.226044 (0.115654)	3.320551 / 2.268929 (1.051622)	1.903350 / 55.444624 (-53.541274)	1.611809 / 6.876477 (-5.264668)	1.729355 / 2.142072 (-0.412717)	0.654622 / 4.805227 (-4.150605)	0.118739 / 6.500664 (-6.381925)	0.041453 / 0.075469 (-0.034016)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.017635 / 1.841788 (-0.824153)	12.275948 / 8.074308 (4.201640)	10.416224 / 10.191392 (0.224832)	0.142288 / 0.680424 (-0.538135)	0.015591 / 0.534201 (-0.518610)	0.286515 / 0.579283 (-0.292768)	0.128661 / 0.434364 (-0.305703)	0.325728 / 0.540337 (-0.214609)	0.415827 / 1.386936 (-0.971109)

May 27 '24 10:05 github-actions[bot]

datasets datasets copied to clipboard

Unpin hfh

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

datasets
datasets copied to clipboard