grobid icon indicating copy to clipboard operation
grobid copied to clipboard

about how to get the training and test datasets

Open majiajun0 opened this issue 1 year ago • 1 comments

Hi! I know I can get test datasets for end to end evaluation, but I am still confused how to get training and test datasets for segmenation training、header training、citation training and so on. Now I get pdfs and create training with grobid。Then I fixed wrongly labeled files of tei.xml format and put them in the training datasets.
It takes a lot of time to correct the wrongly labeled files. I wonder if it is a right way to obtain datasets. Is there a better way?

majiajun0 avatar Nov 28 '22 03:11 majiajun0

Hi @majiajun0 !

This is the current standard way. Unfortunately yes, it's quite slow and painful to correct label data.

To try to speed-up the process, what I am trying to do is first to label a core of examples, retrain the model, regenerate training data for a large set of documents, then select failing documents for correction (instead of correcting randomly lots of documents, where many are already correct so less useful for training). I select the failing documents based on scripts looking at the predicted results, for instance if no title or authors has been detected for the header model.

Other typical approach is to generate aligned XML and PDF pair of documents, like via PMC or arXiv. Nothing to label manually and it can leverage large amount of documents, but I found that it is not working very well and leads to enormous amount of redundant training data, often excluding the complicated and useful cases (those which failed to align automatically!), https://grobid.readthedocs.io/en/latest/Principles/#training-data-qualitat-statt-quantitat
So it creates other issues (coverage, bias/lack of domain portability), which are hard to solve.

kermitt2 avatar Nov 28 '22 11:11 kermitt2