datasets
datasets copied to clipboard
Cannot build hugging face datasets
Short description
$ tfds build huggingface:mnist/mnist
FileNotFoundError: Request failed for https://raw.githubusercontent.com/huggingface/datasets/master/datasets/mnist/dataset_infos.json
Error: 404
Reason: b'404: Not Found'
It seems the index (https://github.com/tensorflow/datasets/blob/751053fdb0f39cfc0d30797d3119b81306b91d5a/tensorflow_datasets/core/community/cache.py#L22) is out of date and hasn't been updated to use the hub: https://github.com/huggingface/datasets/pull/4059.
Environment information
- Operating System: macOS
- Python version: 3.11
-
tensorflow-datasets
/tfds-nightly
version:tfds-nightly
-
tensorflow
/tf-nightly
version: 2.16.1 - Does the issue still exists with the last
tfds-nightly
package (pip install --upgrade tfds-nightly
) ? Yes.
Reproduction instructions
tfds build huggingface:mnist/mnist
If you share a colab, make sure to update the permissions to share it.
Link to logs
INFO[config.py]: Loading namespace config from /usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/community-datasets.toml
Traceback (most recent call last):
File "/usr/local/google/home/phillypham/venv/grain/bin/tfds", line 8, in <module>
sys.exit(launch_cli())
^^^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/main.py", line 105, in launch_cli
app.run(main, flags_parser=_parse_flags)
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/main.py", line 100, in main
args.subparser_fn(args)
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/build.py", line 302, in _build_datasets
builders_cls_and_kwargs = [
^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/build.py", line 303, in <listcomp>
_get_builder_cls_and_kwargs(dataset, has_imports=bool(args.imports))
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/scripts/cli/build.py", line 420, in _get_builder_cls_and_kwargs
builder_cls = tfds.builder_cls(str(name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/load.py", line 114, in builder_cls
return community.community_register().builder_cls(ds_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/registry.py", line 259, in builder_cls
return registers[0].builder_cls(name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/register_package.py", line 249, in builder_cls
installed_dataset = _download_or_reuse_cache(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/register_package.py", line 402, in _download_or_reuse_cache
installed_package = _download_and_cache(package)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/register_package.py", line 449, in _download_and_cache
dataset_sources_lib.download_from_source(
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/community/dataset_sources.py", line 80, in download_from_source
path.copy(dst / path.name)
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/github_api/github_path.py", line 338, in copy
dst.write_bytes(self.read_bytes())
^^^^^^^^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/github_api/github_path.py", line 311, in read_bytes
return get_content(url)
^^^^^^^^^^^^^^^^
File "/usr/local/google/home/phillypham/venv/grain/lib/python3.11/site-packages/tensorflow_datasets/core/github_api/github_path.py", line 44, in get_content
raise FileNotFoundError(
FileNotFoundError: Request failed for https://raw.githubusercontent.com/huggingface/datasets/master/datasets/mnist/dataset_infos.json
Error: 404
Reason: b'404: Not Found'
Expected behavior
For it to work and call download_and_prepare.
Additional context
python -c "import tensorflow_datasets as tfds; tfds.builder('huggingface:mnist/mnist')"
works.
Have you tried replacing /
with __
?
If you're trying to work with mnist, you can pull it from the TensorFlow datasets catalog at https://www.tensorflow.org/datasets/catalog/overview :
python -c "import tensorflow_datasets as tfds; tfds.builder('mnist')"
works as well.
If you do need to pull a dataset from HuggingFace, consider using tfds.load(), and replace /
with __
.
Hope this could help