grobid-dictionaries
grobid-dictionaries copied to clipboard
createAnnotatedTrainingDictionaryBodySegmentation generates linebreak in the resulting TEI file
Hi there !
Unlike other functions without Annotated that do not create \n
tags, it seems that the prefixed createAnnotatedTraining adds line breaks (version 0.5.4, pulled this morning).
createTraining.rawtxt.txt createTraining.xml.txt createAnnotatedTrainign.rawtxt.txt createAnnotatedTrainign.xml.txt
createTraining:
<lb/>que forma ou du moins species comprendrait la couleur, <lb/>la grandeur et autres détails. <lb/>Fimos, v. Lutum. <lb/>Findere. Scindere. Findere, diviser un corps dans le <lb/>sens de ses joints naturels, le décomposer pour ainsi dire <lb/>en ses parties élémentaires, comme fendre et cliver; scin¬ <lb/>dere, le diviser par force sans aucun égard aux joints et le <lb/>mettre en pièces, comme couper et déchirer. Findere <lb/>lignum veut dire fendre une bûche de bois en s'aidant de <lb/>la nature môme du bois, dans le sens de la longueur; mais <lb/>scindere, casser par pure force, en largeur. Le findens <lb/>oequor nave considère la mer comme un assemblage de <lb/>parties liquides; le scindens, comme n'ayant fait qu'un <lb/>tout dès l'origine.
createAnnotated:
<lb/>que forma ou du moins species comprendrait la couleur, <lb/>la grandeur et autres détails</entry>
. <lb/><entry>Fimos, v. Lutum</entry>
. <lb/><entry>Findere. Scindere. Findere, diviser un corps dans le <lb/>sens de ses joints naturels, le décomposer pour ainsi dire <lb/>en ses parties élémentaires, comme fendre et cliver; scin¬ <lb/>dere, le diviser par force sans aucun égard aux joints et le <lb/>mettre en pièces, comme couper et déchirer. Findere <lb/>lignum veut dire fendre une bûche de bois en s'aidant de <lb/>la nature môme du bois, dans le sens de la longueur; mais <lb/>scindere, casser par pure force, en largeur. Le findens <lb/>oequor nave considère la mer comme un assemblage de <lb/>parties liquides; le scindens, comme n'ayant fait qu'un <lb/>tout dès l'origine</entry>
Hi!
thank you for pointing this out. But it shouldn't represent an issue for the training :)
Hi !
Well, the presence of these newlines actually made the result drop significantly though. When I recreated the training data without Annotated
, and as such without newlines, results got better, as expected. :)
could you please upload the data of both cases on GitHub repo so I can have a look?
Currently trying to reproduce the issue ;)
Fichiers générés : syn.zip
Please checkout the latest version and let me know if I can close this issue
It seems it does not change the issue of new line break being added:
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/150-152.pdf -dOut resources -exe createTrainingLexicalEntry
creates an XML files where all the content is in one line only (no \n
are added)
<tei xml:space="preserve">
<teiHeader>
<fileDesc xml:id=""/>
</teiHeader>
<text>
<body><entry>δέειν; vincire et nectere, synonymes de coercere, enchaîner <lb/>pour prévenir la liberté des mouvements, δεσμεύειν. <lb/>2. Ligare est le terme général ;
java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/150-152.pdf -dOut resources -exe createAnnotatedTrainingLexicalEntry
right now creates a document like this, with new lines \n
:
<entry><subEntry>δέειν; vincire et nectere, synonymes de coercere, enchaîner <lb/>pour prévenir la liberté des mouvements, δεσμεύειν</subEntry>
<pc>. <lb/>2. </pc>
<subEntry>Ligare est le terme général ; viere, le terme technique <lb/>à l'usage du tonnelier, du -vannier, etc</subEntry>
<pc>. <lb/>3. </pc>
<subEntry>Obligare, attacher par des prévenances; obstringere, <lb/>lier par des bienfaits; devincire, enchaîner à Loi par des <lb/>relations intimes et durables. Vobligatus se sent engagé <lb/>par les devoirs conventionnels de la vie du monde; Vob-<lb/>strictus, par des devoirs de morale ou de religion: le de¬ <lb/>vinctus, par des devoirs de piété. <lb/>Lima. Scobina. Lima, outil pour polir; scobina, pour <lb/>dégrossir</subEntry>
</entry><entry><lemma>Limes</lemma>
<pc>, v. </pc>
<variant>Finis</variant>
</entry><entry><lemma>Lmus</lemma>
<pc>, V. </pc>
<variant>Lutum</variant>
</entry><entry><lemma>Lingere</lemma>
<pc>, v. </pc>
<variant>Lambere</variant>
Before that :
- I pulled
- Removed old images
- I retrained models, which led to this download before training:
[INFO] ------------------------------------------------------------------------
[INFO] Building grobid-dictionaries 0.5.4-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: http://download.java.net/maven/2/org/grobid/grobid-core/0.5.6-SNAPSHOT/maven-metadata.xml
Downloading: http://download.java.net/maven/2/org/grobid/grobid-trainer/0.5.6-SNAPSHOT/maven-metadata.xml
- docker images -a | head -n 2
REPOSITORY TAG IMAGE ID CREATED SIZE
medkhem/grobid-dictionaries latest d588b8160308 2 days ago 3.16GB
ok. so this means the wrongly escaped line breaks is fixed ;) can you send me the corresponding pdf?
Afaik, yes, but that was not for this issue then ? 150-152.pdf
yeah. The fix was for issue #24 . Sorry for the confusion. Regarding this "issue", it's not actually an issue. In fact, for the annotation, the training files are not supposed to be TEI-compliant documents. The line breaks are generated here on purpose to help the annotator to visually follow the original document and the training XML files - the appellation "tei.xml" is a bit misleading, I do agree. So the line breaks here are translating the layout of lines in the original PDF. They will be removed in the final TEI output which is generated by the web application. I hope it's more clear now...
No worries !
I am not complaining about tags or TEI compliancy here but specifically on the \n
added to the input, which are generating noise for the training after :) So this is an issue :/
This shouldn't represent a noise for the training. Have you annotated the same files in two modes (pre-annotated and manually annotated) and you noticed that there is a difference in the evaluation?
Hey! I join the discussion. I do agree with @PonteIneptique : these newlines are highly problematic and create big problems when it comes down to recognize entries.

@gabays same question: do you have the same data annotated in the two modes, so I could use it to reproduce the problem?
Pardon me please, but having \n
characters added to the text would create a new character for character n-gram, so it would indeed create noise, right ?
The file annotated normally without \n
works completely fine, while the others drop or push things to drop in training. I don't think I tried in eval.
@MedKhem there shoud be some here: https://zenodo.org/record/3383658#.XeEZ4L8o-fQ
@PonteIneptique @gabays the way how grobid is designed, the line breaks are not used as characters in the training. Only the text is used for the training and the line status is extracted in another step. Now I could fix the "\n" thing but I want to make sure that this represents an issue. May I ask you to annotate one or two pages in the two modes and let me know if there is a difference, based on the evaluation numbers?
@MedKhem Sure. Can I ask you just for confirmation : if I add random new lines, this would mean this would not create any issue ?
It depends on where you add them :) For the lexical entry level, if you add a new line between elements of lexical entry (e.g. <sense>, <lemma>,..) that's fine. But if you add it between tokens of an element of the lexical entry (e.g. <lemma>..</lemma>), then there will be a mess
Sure. You do not use carriage returns as a feature? So that is why it is hard for GROBID to recognize catalogue/dictionary entries, the primary feature of which is the carriage return…
no, I do use it :) but as I told you, this is done in a previous stage. We can not make a general conclusion about the performance of a model and the usefulness of features without studying first the quality of the OCRs of the documents we experimented. If the typographic information is not consistent as a result of the OCRisation (e.g. bold tokens are not recognised properly), using the line status wouldn't solve the problem in the case of DictionaryBodySegmentation model. In such cases, more features should be added but they will be ad-hoc and might not work for other documents.
I think I start to understand. The issue then, if the createAnnotated is not perfect, which it might be, then you might end up correcting the <lemma>
and this might as well include carriage return/line break. Which will produce issue for training then :)