grobid
grobid copied to clipboard
I Can't Train a Model
Hi @kermitt2
Follow your guide:
https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/
I create a tei.xml file with almort 4k references, this is an example:
`
<listBibl><bibl>
<author>Oliveira, Fabio De and Pochmann, Marcio</author>
<title level="a">Entrevista: Marcio Pochmann</title>
<title level="s">Cadernos de Psicologia Social do Trabalho</title>
<date>2004</date>
<biblScope unit="page">81</biblScope>
<biblScope unit="volume">7</biblScope>
<biblScope unit="issue">0</biblScope>
<ptr type="web">http://dx.doi.org/10.11606/issn.1981-0490.v7i0p81-91</ptr>
`
With this xml file I try Train and evaluation in one command in this url:
https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/
Followint this steps:
- Put my tei.xml file in this path
grobid/grobid-trainer/resources/dataset/reference-segmenter/corpus/tei
2.Run this comand from root of project:
./gradlew train_reference_segmentation
The process start and finish without errors or warnings but I dont see any change in the behaviour of reference_segmentation task...
Can you tell me what its wrong with my procedure?
Hello @rodyoukai !
In the documentation:
This section describes how to annotate training data for the citation model. This model parses a bibliographical reference in isolation (as typically present in the bibliographical section at the end of an article).
The model to train is the citation
model.
reference-segmenter
is the upstream model than produces the individual reference strings.
So the training data should go under grobid/grobid-trainer/resources/dataset/citation/corpus/
and training command:
./gradlew train_citation
I see, thanks a lot!!!
Hello, this is such an amazing project! Is the default version of Grobid (e.g. the one on demo server) already trained with data from your resource folder? Is there a smart way to avoid training it with data that you have already trained with (in addition to creating my own dataset)?
Thank you @ZhiliWang for the kind words!
The default CRF models part of this repo are the ones used on the demo server (no GPU for the DL models on the demo machine).
Apart using/extending the existing training data or creating a dataset from scratch (as documented), another solution is to use existing publisher XML and the corresponding PDF generated from the XML source (or both derived from a common format). The idea is to align what is extracted from the PDF and the XML to get labeled data. This is done by most of the related tools, like CERMINE, Science Parse 2 and LayoutLM, Selfdoc, VILA, ... See PubLayNet (PMC collection) or DocBank (via Latex arXiv sources) if you are interested with this approach.