grobid-dictionaries icon indicating copy to clipboard operation
grobid-dictionaries copied to clipboard

createAnnotatedTrainingDictionaryBodySegmentation generates linebreak in the resulting TEI file

Open PonteIneptique opened this issue 5 years ago • 22 comments

Hi there !

Unlike other functions without Annotated that do not create \n tags, it seems that the prefixed createAnnotatedTraining adds line breaks (version 0.5.4, pulled this morning).

createTraining.rawtxt.txt createTraining.xml.txt createAnnotatedTrainign.rawtxt.txt createAnnotatedTrainign.xml.txt

createTraining:

<lb/>que forma ou du moins species comprendrait la couleur, <lb/>la grandeur et autres détails. <lb/>Fimos, v. Lutum. <lb/>Findere. Scindere. Findere, diviser un corps dans le <lb/>sens de ses joints naturels, le décomposer pour ainsi dire <lb/>en ses parties élémentaires, comme fendre et cliver; scin¬ <lb/>dere, le diviser par force sans aucun égard aux joints et le <lb/>mettre en pièces, comme couper et déchirer. Findere <lb/>lignum veut dire fendre une bûche de bois en s'aidant de <lb/>la nature môme du bois, dans le sens de la longueur; mais <lb/>scindere, casser par pure force, en largeur. Le findens <lb/>oequor nave considère la mer comme un assemblage de <lb/>parties liquides; le scindens, comme n'ayant fait qu'un <lb/>tout dès l'origine.

createAnnotated:

<lb/>que forma ou du moins species comprendrait la couleur, <lb/>la grandeur et autres détails</entry>
. <lb/><entry>Fimos, v. Lutum</entry>
. <lb/><entry>Findere. Scindere. Findere, diviser un corps dans le <lb/>sens de ses joints naturels, le décomposer pour ainsi dire <lb/>en ses parties élémentaires, comme fendre et cliver; scin¬ <lb/>dere, le diviser par force sans aucun égard aux joints et le <lb/>mettre en pièces, comme couper et déchirer. Findere <lb/>lignum veut dire fendre une bûche de bois en s'aidant de <lb/>la nature môme du bois, dans le sens de la longueur; mais <lb/>scindere, casser par pure force, en largeur. Le findens <lb/>oequor nave considère la mer comme un assemblage de <lb/>parties liquides; le scindens, comme n'ayant fait qu'un <lb/>tout dès l'origine</entry>

PonteIneptique avatar Nov 04 '19 13:11 PonteIneptique

Hi!

thank you for pointing this out. But it shouldn't represent an issue for the training :)

MedKhem avatar Nov 05 '19 13:11 MedKhem

Hi ! Well, the presence of these newlines actually made the result drop significantly though. When I recreated the training data without Annotated, and as such without newlines, results got better, as expected. :)

PonteIneptique avatar Nov 05 '19 15:11 PonteIneptique

could you please upload the data of both cases on GitHub repo so I can have a look?

MedKhem avatar Nov 05 '19 17:11 MedKhem

Currently trying to reproduce the issue ;)

PonteIneptique avatar Nov 06 '19 07:11 PonteIneptique

Fichiers générés : syn.zip

PonteIneptique avatar Nov 06 '19 08:11 PonteIneptique

Please checkout the latest version and let me know if I can close this issue

MedKhem avatar Nov 26 '19 18:11 MedKhem

It seems it does not change the issue of new line break being added:

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/150-152.pdf -dOut resources -exe createTrainingLexicalEntry creates an XML files where all the content is in one line only (no \n are added)

<tei xml:space="preserve">
	<teiHeader>
		<fileDesc xml:id=""/>
	</teiHeader>
	<text>
		<body><entry>δέειν; vincire et nectere, synonymes de coercere, enchaîner <lb/>pour prévenir la liberté des mouvements, δεσμεύειν. <lb/>2. Ligare est le terme général ;

java -jar /grobid/grobid-dictionaries/target/grobid-dictionaries-0.5.4-SNAPSHOT.one-jar.jar -dIn resources/150-152.pdf -dOut resources -exe createAnnotatedTrainingLexicalEntry right now creates a document like this, with new lines \n:

<entry><subEntry>δέειν; vincire et nectere, synonymes de coercere, enchaîner <lb/>pour prévenir la liberté des mouvements, δεσμεύειν</subEntry>
<pc>. <lb/>2. </pc>
<subEntry>Ligare est le terme général ; viere, le terme technique <lb/>à l'usage du tonnelier, du -vannier, etc</subEntry>
<pc>. <lb/>3. </pc>
<subEntry>Obligare, attacher par des prévenances; obstringere, <lb/>lier par des bienfaits; devincire, enchaîner à Loi par des <lb/>relations intimes et durables. Vobligatus se sent engagé <lb/>par les devoirs conventionnels de la vie du monde; Vob-<lb/>strictus, par des devoirs de morale ou de religion: le de¬ <lb/>vinctus, par des devoirs de piété. <lb/>Lima. Scobina. Lima, outil pour polir; scobina, pour <lb/>dégrossir</subEntry>
</entry><entry><lemma>Limes</lemma>
<pc>, v. </pc>
<variant>Finis</variant>
</entry><entry><lemma>Lmus</lemma>
<pc>, V. </pc>
<variant>Lutum</variant>
</entry><entry><lemma>Lingere</lemma>
<pc>, v. </pc>
<variant>Lambere</variant>

Before that :

  • I pulled
  • Removed old images
  • I retrained models, which led to this download before training:
[INFO] ------------------------------------------------------------------------
[INFO] Building grobid-dictionaries 0.5.4-SNAPSHOT
[INFO] ------------------------------------------------------------------------
Downloading: http://download.java.net/maven/2/org/grobid/grobid-core/0.5.6-SNAPSHOT/maven-metadata.xml
Downloading: http://download.java.net/maven/2/org/grobid/grobid-trainer/0.5.6-SNAPSHOT/maven-metadata.xml
  • docker images -a | head -n 2
REPOSITORY                    TAG                 IMAGE ID            CREATED             SIZE
medkhem/grobid-dictionaries   latest              d588b8160308        2 days ago          3.16GB

PonteIneptique avatar Nov 29 '19 10:11 PonteIneptique

ok. so this means the wrongly escaped line breaks is fixed ;) can you send me the corresponding pdf?

MedKhem avatar Nov 29 '19 11:11 MedKhem

Afaik, yes, but that was not for this issue then ? 150-152.pdf

PonteIneptique avatar Nov 29 '19 11:11 PonteIneptique

yeah. The fix was for issue #24 . Sorry for the confusion. Regarding this "issue", it's not actually an issue. In fact, for the annotation, the training files are not supposed to be TEI-compliant documents. The line breaks are generated here on purpose to help the annotator to visually follow the original document and the training XML files - the appellation "tei.xml" is a bit misleading, I do agree. So the line breaks here are translating the layout of lines in the original PDF. They will be removed in the final TEI output which is generated by the web application. I hope it's more clear now...

MedKhem avatar Nov 29 '19 12:11 MedKhem

No worries ! I am not complaining about tags or TEI compliancy here but specifically on the \n added to the input, which are generating noise for the training after :) So this is an issue :/

PonteIneptique avatar Nov 29 '19 12:11 PonteIneptique

This shouldn't represent a noise for the training. Have you annotated the same files in two modes (pre-annotated and manually annotated) and you noticed that there is a difference in the evaluation?

MedKhem avatar Nov 29 '19 12:11 MedKhem

Hey! I join the discussion. I do agree with @PonteIneptique : these newlines are highly problematic and create big problems when it comes down to recognize entries.

Screen Shot 2019-09-22 at 17 58 36

gabays avatar Nov 29 '19 13:11 gabays

@gabays same question: do you have the same data annotated in the two modes, so I could use it to reproduce the problem?

MedKhem avatar Nov 29 '19 13:11 MedKhem

Pardon me please, but having \n characters added to the text would create a new character for character n-gram, so it would indeed create noise, right ? The file annotated normally without \n works completely fine, while the others drop or push things to drop in training. I don't think I tried in eval.

PonteIneptique avatar Nov 29 '19 13:11 PonteIneptique

@MedKhem there shoud be some here: https://zenodo.org/record/3383658#.XeEZ4L8o-fQ

gabays avatar Nov 29 '19 13:11 gabays

@PonteIneptique @gabays the way how grobid is designed, the line breaks are not used as characters in the training. Only the text is used for the training and the line status is extracted in another step. Now I could fix the "\n" thing but I want to make sure that this represents an issue. May I ask you to annotate one or two pages in the two modes and let me know if there is a difference, based on the evaluation numbers?

MedKhem avatar Nov 29 '19 13:11 MedKhem

@MedKhem Sure. Can I ask you just for confirmation : if I add random new lines, this would mean this would not create any issue ?

PonteIneptique avatar Nov 29 '19 13:11 PonteIneptique

It depends on where you add them :) For the lexical entry level, if you add a new line between elements of lexical entry (e.g. <sense>, <lemma>,..) that's fine. But if you add it between tokens of an element of the lexical entry (e.g. <lemma>..</lemma>), then there will be a mess

MedKhem avatar Nov 29 '19 13:11 MedKhem

Sure. You do not use carriage returns as a feature? So that is why it is hard for GROBID to recognize catalogue/dictionary entries, the primary feature of which is the carriage return…

gabays avatar Nov 29 '19 13:11 gabays

no, I do use it :) but as I told you, this is done in a previous stage. We can not make a general conclusion about the performance of a model and the usefulness of features without studying first the quality of the OCRs of the documents we experimented. If the typographic information is not consistent as a result of the OCRisation (e.g. bold tokens are not recognised properly), using the line status wouldn't solve the problem in the case of DictionaryBodySegmentation model. In such cases, more features should be added but they will be ad-hoc and might not work for other documents.

MedKhem avatar Nov 29 '19 13:11 MedKhem

I think I start to understand. The issue then, if the createAnnotated is not perfect, which it might be, then you might end up correcting the <lemma> and this might as well include carriage return/line break. Which will produce issue for training then :)

PonteIneptique avatar Nov 29 '19 13:11 PonteIneptique