NFETC icon indicating copy to clipboard operation
NFETC copied to clipboard

One-hot label percentage error in Wikim dataset

Open cairoHy opened this issue 4 years ago • 0 comments

Hi,

After downloading the corpus and preprocessing using transform.py, we find that 92.9% of wikim test samples have one-hot labels. The statistics are different from those shown in the paper.

Our results: image

Statistics in the paper: image

We preprocess the wikim dataset following the README.md. We calculate the statistics by adding the above code snippet after line 77 in task.py:

label_k = [x[-1].sum() for x in self.full_test_set]
label_one_hot = [x for x in label_k if x == 1]
label_multi_hot = [x for x in label_k if x != 1]
logger.info('{}/{} one hot, {}/{} multi hot.'.format(len(label_one_hot), len(label_k), len(label_multi_hot), len(label_k)))
label_k = [x[-1].sum() for x in self.test_set]
label_one_hot = [x for x in label_k if x == 1]
label_multi_hot = [x for x in label_k if x != 1]
logger.info('test set: {}/{} one hot, {}/{} multi hot.'.format(len(label_one_hot), len(label_k),len(label_multi_hot),len(label_k)))

cairoHy avatar Apr 30 '20 02:04 cairoHy