datasets
datasets copied to clipboard
Dataset Viewer issue for shamikbose89/lancaster_newsbooks
Link
https://huggingface.co/datasets/shamikbose89/lancaster_newsbooks
Description
Status code: 400 Exception: ValueError Message: Cannot seek streaming HTTP file
I am able to use the dataset loading script locally and it also runs when I'm using the one from the hub, but the viewer still doesn't load
Owner
Yes
It seems like the list of splits could not be obtained:
>>> from datasets import get_dataset_split_names
>>> get_dataset_split_names("shamikbose89/lancaster_newsbooks", "default")
Using custom data configuration default
Traceback (most recent call last):
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 354, in get_dataset_config_info
for split_generator in builder._split_generators(
File "/home/slesage/.cache/huggingface/modules/datasets_modules/datasets/shamikbose89--lancaster_newsbooks/2d1c63d269bf7b9342accce0a95960b1710ab4bc774248878bd80eb96c1afaf7/lancaster_newsbooks.py", line 73, in _split_generators
data_dir = dl_manager.download_and_extract(_URL)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 916, in download_and_extract
return self.extract(self.download(url_or_urls))
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 879, in extract
urlpaths = map_nested(self._extract, path_or_paths, map_tuple=True)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 348, in map_nested
return function(data_struct)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 884, in _extract
protocol = _get_extraction_protocol(urlpath, use_auth_token=self.download_config.use_auth_token)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 388, in _get_extraction_protocol
return _get_extraction_protocol_with_magic_number(f)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/download/streaming_download_manager.py", line 354, in _get_extraction_protocol_with_magic_number
f.seek(0)
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/fsspec/implementations/http.py", line 684, in seek
raise ValueError("Cannot seek streaming HTTP file")
ValueError: Cannot seek streaming HTTP file
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 404, in get_dataset_split_names
info = get_dataset_config_info(
File "/home/slesage/hf/datasets-server/services/worker/.venv/lib/python3.9/site-packages/datasets/inspect.py", line 359, in get_dataset_config_info
raise SplitsNotFoundError("The split names could not be parsed from the dataset config.") from err
datasets.inspect.SplitsNotFoundError: The split names could not be parsed from the dataset config.
ping @huggingface/datasets
Oh, I removed the 'split' key from kwargs
. I put it back in, but there's still the same error
It looks like the data host doesn't support http range requests, which is necessary to glob inside a ZIP archive in streaming mode. Can you try hosting the dataset elsewhere ? Or download each file separately from https://ota.bodleian.ox.ac.uk/repository/xmlui/handle/20.500.12024/2531 ?
@lhoestq Thanks! That seems to have solved it. I can get the splits with the get_dataset_split_names()
function. The dataset viewer is still not loading properly, though. The new error is
Status code: 400
Exception: BadZipFile
Message: File is not a zip file
PS. The dataset loads properly and can be accessed