grobid icon indicating copy to clipboard operation
grobid copied to clipboard

I Can't Train a Model

Open rodyoukai opened this issue 3 years ago • 4 comments

Hi @kermitt2

Follow your guide:

https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/

I create a tei.xml file with almort 4k references, this is an example:

`

<listBibl><bibl>
<author>Oliveira, Fabio De and Pochmann, Marcio</author>
<title level="a">Entrevista: Marcio Pochmann</title>
<title level="s">Cadernos de Psicologia Social do Trabalho</title>
<date>2004</date>
<biblScope unit="page">81</biblScope>
<biblScope unit="volume">7</biblScope>
<biblScope unit="issue">0</biblScope>
<ptr type="web">http://dx.doi.org/10.11606/issn.1981-0490.v7i0p81-91</ptr>
`

With this xml file I try Train and evaluation in one command in this url:

https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/

Followint this steps:

  1. Put my tei.xml file in this path

grobid/grobid-trainer/resources/dataset/reference-segmenter/corpus/tei

2.Run this comand from root of project:

./gradlew train_reference_segmentation

The process start and finish without errors or warnings but I dont see any change in the behaviour of reference_segmentation task...

Can you tell me what its wrong with my procedure?

rodyoukai avatar Aug 20 '21 02:08 rodyoukai

Hello @rodyoukai !

In the documentation:

This section describes how to annotate training data for the citation model. This model parses a bibliographical reference in isolation (as typically present in the bibliographical section at the end of an article).

The model to train is the citation model.

reference-segmenter is the upstream model than produces the individual reference strings.

So the training data should go under grobid/grobid-trainer/resources/dataset/citation/corpus/

and training command:

./gradlew train_citation

kermitt2 avatar Aug 20 '21 02:08 kermitt2

I see, thanks a lot!!!

rodyoukai avatar Aug 20 '21 14:08 rodyoukai

Hello, this is such an amazing project! Is the default version of Grobid (e.g. the one on demo server) already trained with data from your resource folder? Is there a smart way to avoid training it with data that you have already trained with (in addition to creating my own dataset)?

ZhiliWang avatar Aug 24 '21 18:08 ZhiliWang

Thank you @ZhiliWang for the kind words!

The default CRF models part of this repo are the ones used on the demo server (no GPU for the DL models on the demo machine).

Apart using/extending the existing training data or creating a dataset from scratch (as documented), another solution is to use existing publisher XML and the corresponding PDF generated from the XML source (or both derived from a common format). The idea is to align what is extracted from the PDF and the XML to get labeled data. This is done by most of the related tools, like CERMINE, Science Parse 2 and LayoutLM, Selfdoc, VILA, ... See PubLayNet (PMC collection) or DocBank (via Latex arXiv sources) if you are interested with this approach.

kermitt2 avatar Aug 25 '21 07:08 kermitt2