scibert icon indicating copy to clipboard operation
scibert copied to clipboard

JNLPBA dataset

Open stefan-it opened this issue 5 years ago • 3 comments

Hi,

thanks for releasing the SciBERT model and datasets :heart:

I'm currently integrating an importing method of the NER data into the flair library.

I checked the number of imported sentences for all dataset splits and there's a mismatch of 2404 sentences compared to the total sentences number in table 2 of the paper (24,806). Then I checked the JNLPBA dataset and it seems that all -DOCSTART- O lines were also counted, which is I think a bit redundant.

The number of training and development sentences is also a bit different than the values reported in the BioBERT paper. The BioBERT uses a split of 14,690 / 3,856 / 3,856, whereas the provided data in this repository uses a split of 16,807 / 1,739 / 3,856. Could you confirm this?

Thanks + regards,

Stefan

stefan-it avatar Mar 27 '19 17:03 stefan-it

Hey Stefan, Thanks for your interest in the project. I'll look into the line counting issue, and update the reported numbers. As for the dataset splits in JNLPBA, it might be an issue of using different sources of the dataset files.
The JNLPBA dataset we used was pulled from https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/JNLPBA which results in files w/ number of lines:

   49600 dev.txt
  105703 test.txt
  465497 train.txt

kyleclo avatar Mar 29 '19 14:03 kyleclo

@kyleclo, The JNLPBA dataset you used was pulled from https://github.com/cambridgeltl/MTL-Bioinformatics-2016/tree/master/data/JNLPBA. Then why do the tags in your processed data show B-Entity instead of say, B-Disease?

shreyashub avatar Jun 19 '19 19:06 shreyashub

@shreyashub, I think you are talking about bc5cdr not JNLPBA because JNLPBA doesn't have Disease category. For bc5cdr, we used a version that we had in s2 that dropped the entity types and combined bc5cdr-disease and bc5cdr-chem in one. I agree it would have been better to use the original the dataset.

ibeltagy avatar Jul 03 '19 19:07 ibeltagy