datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Dataset Viewer issue for shamikbose89/lancaster_newsbooks

Open shamikbose opened this issue 2 years ago • 4 comments

Link

https://huggingface.co/datasets/shamikbose89/lancaster_newsbooks

Description

Status code: 400 Exception: ValueError Message: Cannot seek streaming HTTP file

I am able to use the dataset loading script locally and it also runs when I'm using the one from the hub, but the viewer still doesn't load

Owner

Yes

shamikbose avatar Jul 19 '22 20:07 shamikbose

It seems like the list of splits could not be obtained:

>>> from datasets import get_dataset_split_names
>>> get_dataset_split_names("shamikbose89/lancaster_newsbooks", "default")
Using custom data configuration default
Traceback (most recent call last):
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 354, in get_dataset_config_info
    for split_generator in builder._split_generators(
  File "/home/slesage/.cache/huggingface/modules/datasets_modules/datasets/shamikbose89--lancaster_newsbooks/2d1c63d269bf7b9342accce0a95960b1710ab4bc774248878bd80eb96c1afaf7/lancaster_newsbooks.py", line 73, in _split_generators
    data_dir = dl_manager.download_and_extract(_URL)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 916, in download_and_extract
    return self.extract(self.download(url_or_urls))
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 879, in extract
    urlpaths = map_nested(self._extract, path_or_paths, map_tuple=True)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 348, in map_nested
    return function(data_struct)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 884, in _extract
    protocol = _get_extraction_protocol(urlpath, use_auth_token=self.download_config.use_auth_token)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 388, in _get_extraction_protocol
    return _get_extraction_protocol_with_magic_number(f)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 354, in _get_extraction_protocol_with_magic_number
    f.seek(0)
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 684, in seek
    raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 404, in get_dataset_split_names
    info = get_dataset_config_info(
  File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 359, in get_dataset_config_info
    raise SplitsNotFoundError("The split names could not be parsed from the dataset config.") from err
datasets.inspect.SplitsNotFoundError: The split names could not be parsed from the dataset config.

ping @huggingface/datasets

severo avatar Jul 19 '22 20:07 severo

Oh, I removed the 'split' key from kwargs. I put it back in, but there's still the same error

shamikbose avatar Jul 19 '22 20:07 shamikbose

It looks like the data host doesn't support http range requests, which is necessary to glob inside a ZIP archive in streaming mode. Can you try hosting the dataset elsewhere ? Or download each file separately from https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2531 ?

lhoestq avatar Jul 20 '22 13:07 lhoestq

@lhoestq Thanks! That seems to have solved it. I can get the splits with the get_dataset_split_names() function. The dataset viewer is still not loading properly, though. The new error is

Status code:   400
Exception:     BadZipFile
Message:       File is not a zip file

PS. The dataset loads properly and can be accessed

shamikbose avatar Jul 20 '22 15:07 shamikbose