BalancedLossNLP icon indicating copy to clipboard operation
BalancedLossNLP copied to clipboard

Datasets Preparation

Open zzzmm1 opened this issue 3 years ago • 4 comments

Thanks for sharing your excellent work! I am a newcomer in the field of multi-label text classification. I don’t know where to download the train_data.json and test_data.json of Reuters-21578, also the data2020.json and data2021.json of PubMed-BioASQ. These files are not included in the data downloaded from "https://www.kaggle.com/nltkdata/reuters". Could you please provide the data used in the paper such as train_data.json, data_train.rand123 and labels_ref.rand123? I really want to follow your work as soon as possible, thank you very much!

zzzmm1 avatar Nov 18 '21 05:11 zzzmm1

Hi @zzzmm1 , thanks for your interest. Both datasets have their licences, thus we cannot provide them directly. Please check the dataset section of README https://github.com/Roche/BalancedLossNLP#datasets

IMHO, it's helpful to preprocess and understand the dataset before running the pipeline. To help you start, let's take Reuters-21578 as an example. Although you can not get the files named exactly train_data.json or test_data.json, you can find a file named cats.txt indicating the labels of all documents. With that and the instances in training and test folders, you can try

converting each news document to a JSON list element with the properties: "labels" and "text"

so that the train_data.json would be like

[{'labels': ['coca'], 'text': 'BAHIA ...'}, ...]

The files with suffix .rand123 will be dumped by dataset_prep.ipynb, you can use other random seeds as well.

Please let me know if there is any question.

blessu avatar Nov 19 '21 10:11 blessu

Thanks very much for your reply. I understand your preprocessing method, and I got a copy of train_data.json and test_data.json, but I don’t know if they are the same as what you used? The details are the devil. In order to maintain consistency with your experimental design and reproduce your excellent results, could you please provide a preprocessing script that converts the downloaded raw data into the json file used in your experimence? This will provide a great help.

zzzmm1 avatar Nov 19 '21 11:11 zzzmm1

Thanks for your interest, the following snippet should work:

# after downloading dataset from kaggle and unzipping it
import json
training_data = []
test_data = []

for line in open('data/reuters/reuters/cats.txt', encoding='ISO-8859-2').read().split('\n'):
    words = line.split(' ')
    if words and words[0]:
        text = open(f'data/reuters/reuters/{words[0]}', encoding='ISO-8859-2').read()
        if words[0].startswith('training'):
            training_data.append({'labels': words[1:], 'text': text})
        elif words[0].startswith('test'):
            test_data.append({'labels': words[1:], 'text': text})
            
json.dump(training_data, open('data/training_data.json', 'w'))
json.dump(test_data, open('data/test_data.json', 'w'))

blessu avatar Nov 29 '21 08:11 blessu

非常感谢您的回复。 我了解了您的预处理方法,也获取了 train_data.json 和 test_data.json 的副本,但您使用的是否一样?细节就是魔鬼。 为了我们的实验设计并重现您的优秀结果,您能否提供将下载的原始数据转换为比 json 更好的数据的预处理脚本?它可以 提供很多帮助。

请问你该论文算法复现成功了吗

sk0829 avatar Jun 17 '24 07:06 sk0829