biomedical icon indicating copy to clipboard operation
biomedical copied to clipboard

BC5CDR links are not working anbymore

Open drAbreu opened this issue 8 months ago • 0 comments

Describe the bug

The links to the bc5cdr dataset are no longer valid.

Steps to reproduce the bug

from datasets import load_dataset
bc5_bigbio = load_dataset("bigbio/bc5cdr", "bc5cdr_source")
bc5_bigbio

Expected results

The dataset loaded, including the DatasetDict description from HuggingFace.

Actual results

Downloading and preparing dataset bc5cdr/bc5cdr_source to [/root/.cache/huggingface/datasets/bigbio___bc5cdr/bc5cdr_source/1.5.16/68f03988d9e501c974d9f9987183bf06474858d1318ed0d4e51cfc4584f0f51f...](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f62696f2d646973656173652d6e65722d636f727075735f6e6c705f31227d-0040ssh-002dremote-002bembo-002ddgx02.vscode-resource.vscode-cdn.net/root/.cache/huggingface/datasets/bigbio___bc5cdr/bc5cdr_source/1.5.16/68f03988d9e501c974d9f9987183bf06474858d1318ed0d4e51cfc4584f0f51f...)
---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
[/tmp/ipykernel_10951/1398013518.py](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f62696f2d646973656173652d6e65722d636f727075735f6e6c705f31227d-0040ssh-002dremote-002bembo-002ddgx02.vscode-resource.vscode-cdn.net/tmp/ipykernel_10951/1398013518.py) in <module>
----> 1 bc5_bigbio = load_dataset("bigbio/bc5cdr", "bc5cdr_source")
      2 bc5_bigbio

[/opt/conda/lib/python3.8/site-packages/datasets/load.py](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f62696f2d646973656173652d6e65722d636f727075735f6e6c705f31227d-0040ssh-002dremote-002bembo-002ddgx02.vscode-resource.vscode-cdn.net/opt/conda/lib/python3.8/site-packages/datasets/load.py) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, **config_kwargs)
   1780 
   1781     # Download and prepare data
-> 1782     builder_instance.download_and_prepare(
   1783         download_config=download_config,
   1784         download_mode=download_mode,

[/opt/conda/lib/python3.8/site-packages/datasets/builder.py](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f62696f2d646973656173652d6e65722d636f727075735f6e6c705f31227d-0040ssh-002dremote-002bembo-002ddgx02.vscode-resource.vscode-cdn.net/opt/conda/lib/python3.8/site-packages/datasets/builder.py) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    870                         if num_proc is not None:
    871                             prepare_split_kwargs["num_proc"] = num_proc
--> 872                         self._download_and_prepare(
    873                             dl_manager=dl_manager,
    874                             verification_mode=verification_mode,

[/opt/conda/lib/python3.8/site-packages/datasets/builder.py](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f62696f2d646973656173652d6e65722d636f727075735f6e6c705f31227d-0040ssh-002dremote-002bembo-002ddgx02.vscode-resource.vscode-cdn.net/opt/conda/lib/python3.8/site-packages/datasets/builder.py) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
   1647 
   1648     def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1649         super()._download_and_prepare(
   1650             dl_manager,
...
--> 532             raise ConnectionError(f"Couldn't reach {url} ({repr(head_error)})")
    533         elif response is not None:
    534             raise ConnectionError(f"Couldn't reach {url} (error {response.status_code})")

ConnectionError: Couldn't reach http://www.biocreative.org/media/store/files/2016/CDR_Data.zip (ConnectionError(MaxRetryError("HTTPConnectionPool(host='[www.biocreative.org](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f62696f2d646973656173652d6e65722d636f727075735f6e6c705f31227d-0040ssh-002dremote-002bembo-002ddgx02.vscode-resource.vscode-cdn.net/app/src/bio_disease_ner_corpus/www.biocreative.org)', port=80): Max retries exceeded with url: [/media/store/files/2016/CDR_Data.zip](https://vscode-remote+attached-002dcontainer-002b7b22636f6e7461696e65724e616d65223a222f62696f2d646973656173652d6e65722d636f727075735f6e6c705f31227d-0040ssh-002dremote-002bembo-002ddgx02.vscode-resource.vscode-cdn.net/media/store/files/2016/CDR_Data.zip) (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3fd2741370>: Failed to establish a new connection: [Errno -2] Name or service not known'))")))
Output is truncated. View as a [scrollable element](command:cellOutput.enableScrolling?542058fa-60cd-405e-b034-fd0a246b065e) or open in a [text editor](command:workbench.action.openLargeOutput?542058fa-60cd-405e-b034-fd0a246b065e). Adjust cell output [settings](command:workbench.action.openSettings?%5B%22%40tag%3AnotebookOutputLayout%22%5D)...

Environment info

Copy-and-paste the text below in your GitHub issue.

  • datasets version: 2.10.1
  • Platform: Linux-4.15.0-210-generic-x86_64-with-glibc2.10
  • Python version: 3.8.12
  • PyArrow version: 14.0.1
  • Pandas version: 1.3.5

drAbreu avatar Dec 04 '23 11:12 drAbreu