datasets icon indicating copy to clipboard operation
datasets copied to clipboard

Broken Link to PubMed Abstracts dataset .

Open sameemqureshi opened this issue 2 years ago • 5 comments

Describe the bug

The link provided for the dataset is broken, data_files = https://the-eye.eu/public/AI/pile_preliminary_components/PUBMED_title_abstracts_2019_baseline.jsonl.zst

The

Steps to reproduce the bug

Steps to reproduce:

  1. Head over to https://huggingface.co/learn/nlp-course/chapter5/4?fw=pt#big-data-datasets-to-the-rescue

  2. In the Section "What is the Pile?", you can see a code snippet that contains the broken link.

Expected behavior

The link should Redirect to the "PubMed Abstracts dataset" as expected .

Environment info

.

sameemqureshi avatar Oct 01 '23 19:10 sameemqureshi

This has already been reported in the HF Course repo (https://github.com/huggingface/course/issues/623).

mariosasko avatar Oct 02 '23 15:10 mariosasko

@lhoestq @albertvillanova @lewtun I don't think we are allowed to host these data files on the Hub (due to DMCA), which means the only option is to use a different dataset in the course (and to re-record the video 🙂), no?

mariosasko avatar Oct 02 '23 15:10 mariosasko

Keeping the video is maybe fine, we can add a note on youtube to suggest to load a dataset with a different name. Maybe C4 ? And update the code snippets on the website ?

lhoestq avatar Oct 02 '23 16:10 lhoestq

Maybe you want to try it with the PUBMED dataset that I reproduced based on the The PubMed Abstract GitHub Site and uploaded on the HuggingFace:

from datasets import load_dataset
pubmed_dataset = load_dataset("hwang2006/PUBMED_title_abstracts_2020_baseline")
pubmed_dataset

#Downloading data: 100%
#7.98G/7.98G [11:47<00:00, 9.68MB/s]
#Generating train split: 17722096/0 [00:36<00:00, 505376.37 examples/s]

#DatasetDict({
#   train: Dataset({
#        features: ['meta', 'text'],
#        num_rows: 17722096
#    })
#})

qualis2006 avatar Jan 08 '24 11:01 qualis2006

孔令涛说感谢感谢

dosomethingbyme avatar Apr 28 '24 02:04 dosomethingbyme