datasets-viewer icon indicating copy to clipboard operation
datasets-viewer copied to clipboard

Cache directory seems to be messed up

Open severo opened this issue 4 years ago • 0 comments

I ran datasets-viewer locally and accessed the mrpc subset of the glue dataset.

Then I followed https://huggingface.co/docs/datasets/quicktour.html, in particular I loaded the same subset + dataset with:

>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')

Then looking at the cache directory, the data seems to be messed up:

$ tree -L 3 ~/.cache/huggingface/datasets/glue/mrpc
/Users/slesage/.cache/huggingface/datasets/glue/mrpc
└── 1.0.0
    ├── LICENSE
    ├── dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
    │   ├── LICENSE
    │   ├── dataset_info.json
    │   ├── glue-test.arrow
    │   ├── glue-train.arrow
    │   └── glue-validation.arrow
    ├── dataset_info.json
    ├── glue-test.arrow
    ├── glue-train.arrow
    └── glue-validation.arrow

2 directories, 10 files

Also: I originally did it the other way (first load the subset from the docs tutorial, then access it through a local instance of datasets-viewer) and I got an exception

2021-07-21 11:34:31.631 Traceback (most recent call last):
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/streamlit/script_runner.py", line 349, in _run_script
    exec(code, module.__dict__)
  File "/Users/slesage/hf/datasets-viewer/run.py", line 215, in <module>
    dts, fail = get(str(option), str(conf_option.name) if conf_option else None)
  File "/Users/slesage/hf/datasets-viewer/run.py", line 148, in get
    builder_instance = builder_cls(name=conf, cache_dir=path if path_to_datasets is not None else None)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/builder.py", line 1014, in __init__
    super(GeneratorBasedBuilder, self).__init__(*args, **kwargs)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/builder.py", line 269, in __init__
    self.info = DatasetInfo.from_directory(self._cache_dir)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/info.py", line 260, in from_directory
    with open(os.path.join(dataset_info_dir, config.DATASET_INFO_FILENAME), "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/slesage/.cache/huggingface/datasets/glue/mrpc/1.0.0/dataset_info.json'

because /Users/slesage/.cache/huggingface/datasets/glue/mrpc existed but only contained dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad while datasets-viewerexpected to find dataset_info.json.

severo avatar Jul 21 '21 10:07 severo