naacl_transfer_learning_tutorial icon indicating copy to clipboard operation
naacl_transfer_learning_tutorial copied to clipboard

Label handling commit breaks the imdb finetuning script

Open prrao87 opened this issue 5 years ago • 0 comments

Thomas, thanks for sharing this code! I noticed that commit 8d9c2371fc37ba8958f174501a9afb91a7ef7a06 seems to have broken the default functioning of the classification finetuning scripts - in the previous version there seems to have been a key called 'labels' associated with the imdb and trec dictionaries, but in finetuning_train.py this line still references the now deleted key.

I updated the line to just use DATASETS_LABELS_URL['imdb']['test'] as intended, but then it seems that the S3 bucket doesn't have the IMDB test file.

See below:

file_path = "https://s3.amazonaws.com/datasets.huggingface.co/imdb/test.labels.txt"
label_file = cached_path(file_path)
with open(label_file, "r", encoding="utf-8") as f:
    all_lines = f.readlines()
    print(all_lines[:5])

Gives:

['<?xml version="1.0" encoding="UTF-8"?>\n', '<Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Key>imdb/test.labels.txt</Key><RequestId>3D9E7C511167A0FB</RequestId><HostId>RiidOcrHfFaqxW9tmUXRppE/G3lsYoCZcq+uaYDi2yPPoe8mv/Og6PMuUncwk+B53tGsvcCZMWk=</HostId></Error>']

Does the test file for IMDB still exist with this name? This doesn't seem to be an issue with TREC.

prrao87 avatar Jul 02 '19 10:07 prrao87