grobid
grobid copied to clipboard
Hyphen at line break removed
In the first pubmed evaluation manuscript, a number of times 'α2-integrin' is at a line break, e.g.: "was mediated through the inhibition of expression of α2- integrin (1,2). Integrins are receptors that mediate attachment"
In the output it becomes: "was mediated through the inhibition of expression of α 2integrin..." (The space is another issue https://github.com/kermitt2/grobid/issues/179)
In some cases it may be desirable to remove the hyphen. Not in this case. Probably never when there is a number?
Actually there is another not so clear word hyphenation example: "Mean PK parameters CL, V, and F were calculated by non- compartmental analysis. The tumor growth experiments in"
Becomes: "Mean PK parameters CL, V, and F were calculated by noncompartmental analysis...."
In the nxml file it is annotated as 'non-compartmental'. I believe both versions are valid. So I would probably not treat that as a bug. (But I thought it was worth mentioning anyhow)
Thanks for the issue!
Dehyphenization is tricky ;). I was aware of these issues, but it's very useful to have a dedicated issue for that and discuss.
So far what is implemented now is very simplistic - it does not require dictionaries and resources, but it produces the errors you are mentioning. Basically, when there is an hyphen at the end of the line, we always dehyphenize.
We could introduce additional rules to improve it, like looking if we have numbers in the tokens before concatenating it, and/or to have a language-specific list of prefix (like anti-, non-, post-). We might always have some rare errors/exceptions. Adding rules manually is endless and not really in the spirit of GROBID...
Another approach could be to use a Machine Learning based text cleaner/normalizer, for instance like seq2seq
, but then the problem is to have enough training data. The advantage would be maybe to tackle at the same time other text cleaning problems like diacritics combinations, invalid spacing, etc.
Intuitively I was going to claim that it isn't common that words are broken across lines. Scanning through the same PDF the evidence shows it is quite common actually. One approach could also be to look at other examples within the same document. I checked a few examples (around 7), the only one I couldn't find another time so far was 'NON-MEM'. If it was for me, then I would like to have the option to include an element for the hyphens at line boundaries. Then I could do some post processing.
Would you be happy to use the PMC dataset as training data?
Otherwise that seems to be training data that should be fairly easily extracted from any PDF with XML / text data (at least in English). Maybe even without PDF.
On 05/04/17 13:16, Daniel Ecer wrote:
Intuitively I was going to claim that it isn't common that words are broken across lines. Scanning through the same PDF the evidence shows it is quite common actually.
That's one of the multiple differences between MSWord and LaTeX: by default, Word does not "hyphenize" words (and did not propose an option to do so before v2010), whereas LaTeX always had an option to do so (at least manually - to my knowledge, dating back to the 90s). Since most of the scientific papers are written with LaTeX, you'll indeed encounter many word hyphenation :)
-- Guillaume MULLER, PhD Presans c/o REMIX COWORKING - L'APPART 57 rue de Turbigo 75003 Paris France http://www.presans.com http://feeds.feedburner.com/OYI/fr
As a complement, here is a blog post from one founder of Authorea about the ratio of scientific papers written with LaTeX (~18% of all articles, but, as expected, largely dominating in a couple of domains).
https://www.authorea.com/users/3/articles/107393-how-many-scholarly-articles-are-written-in-latex/_show_article
We are also running into this issue. Example PDF: http://ecp.acponline.org/sepoct01/kent.pdf
There are a lot of hyphens in the text at the end of sentences, breaking words:
E.g. "modeling techniques" becomes "modeling tech- niques".
Is it possible to at least return the newlines, so we can modify the result ourself based on some rules?
@borkdude thank you for reporting these errors - I think the best would be to have a more robust dehyphenization process (it works not that bad normally...).
The problem with outputing the End Of Line in the final TEI result are:
-
sometimes EOL are pure garbage in some PDF (like one word per line), and often there are much more EOL in the actual PDF stream than what we see, so it might not be so useful for post-processing (GROBID on the other hand has all the coordinates information in each token to improve a dehyphenation)
-
the principle of the TEI is to give the logical structure of the document, abstracting from any presentation information. So this would require to write some sort of debug alternative TEI output, and everybody would want different information out for post-processing and it would be a pain to develop and maintain I think. So the best imho is to try to get the dehyphanization as good as possibly in GROBID, because dehyphenation is really part of its job.
@lfoppiano hello Luca, would you have some time to look at these dehyphenization errors? my bad excuse: you're the last one who has modified it :D :D ?
FYI I'm checking on the pdf kent.pdf
:
- the text coming from the abstract (
BiblioItem.getAbstract()
) contains missing break lines which make the dehypenisation failing (according to the naive assumption that a hypen + breakline (whatever form) = dehypenisation). See the raw text before the call todehypenise()
:
CONTEXT. A meta-analysis found that primary percutaneous transluminal coronary angioplasty (PTCA) was more effective than thrombolytic therapy in reducing mor-tality from acute myocardial infarction. However, fewer than 20% of U.S. hospitals have facilities to perform PTCA and many clinicians must choose between immedi-ate thrombolytic therapy and delayed PTCA. COUNT. The number of minutes of PTCA-related delay that would nullify its bene-fits. CALCULATION. For 10 published randomized trials, we calculated the following: PTCA-related delay = median "door-to-balloon" time -median "door-to-needle" time Survival benefit = 30-day mortality after thrombolytic therapy -30-day mortality after PTCA The relationship between delay and benefit was assessed with linear regression. RESULTS. The reported PTCA-related delay ranged from 7 to 59 minutes, while the absolute survival benefit ranged from -2.2% (favoring thrombolytic therapy) to 7.4% (favoring PTCA). Across trials, the survival benefit decreased as the PTCA-related delay increased: For each additional 10-minute delay, the benefit was predicted to decrease 1.7% (P< 0.001). Linear regression showed that at a PTCA-related delay of 50 minutes, PTCA and thrombolytic therapy yielded equivalent reductions in mor-tality . CONCLUSIONS. In clinical trials with short PTCA-related delays, PTCA produced better outcomes, while trials with longer delays favored thrombolytic therapy. A more precise estimate of the time interval to equipoise between the two therapies needs to be modeled with patient-level data. At experienced cardiac centers, PTCA is probably still preferable, even with delays longer than 50 minutes.
- In the fulltext the dehypenisation is not applied (it was removed, perhaps because it wasn't working so well?).
What I would do is:
- apply dephypenisation using LayoutToken and not text (method taking text can always tokenized on the fly)
- review (and migrate using the Clusteror) how the abstract is extracted (perhaps this should be another task?) because at first glance, it looks like some breakline are lost
-
yes that's why we apply in this case another dehyphenization method (dehyphenizeHard()) which is not excepting a break line. It explains why
mor- tality
is correctly dehyphenized asmortality
in the abstratc of the pdf example. The problem with header model is that it is very old and is not working with LayoutToken. The best would be to update this complete model and put it in line with the other models, but that's quite a lot of work. -
Ah it's my mistake it seems, but I don't remember exactly why I removed it. Probably I wanted to have it explicitly called where it is relevant (even in the full text, in some field, like formula, we don't want to dehyphenized, but still normalize the text), via TextUtilities.dehyphenize() in the appropriate fields in TEIFormatter.java.
About the dephypenisation, the current method using LayoutToken is not working well. Dephypenisation using text is much better for the moment because more flexible with the spaces around, it's why this one was used. The method using LayoutToken should be reviewed/extended I think.
Be careful that dehyphenize must be called only in certain fields where we are sure to have only text, performing it at clusteror level does not seem the right moment, because we still not know what is exactly the type of the current labelled segment.
yes that's why we apply in this case another dehyphenization method (dehyphenizeHard()) which is not excepting a break line. It explains why mor- tality is correctly dehyphenized as mortality in the abstratc of the pdf example. The problem with header model is that it is very old and is not working with LayoutToken. The best would be to update this complete model and put it in line with the other models, but that's quite a lot of work.
- the issue with
mor- tality
is before the dehypenisation since the\n
is lost. First reason of why my suggestion to work directly on the layout tokens.
Ah it's my mistake it seems, but I don't remember exactly why I removed it. Probably I wanted to have it explicitly called where it is relevant (even in the full text, in some field, like formula, we don't want to dehyphenized, but still normalize the text), via TextUtilities.dehyphenize() in the appropriate fields in TEIFormatter.java.
- Yes, indeed it has to be applied only to text
About the dephypenisation, the current method using LayoutToken is not working well. Dephypenisation using text is much better for the moment because more flexible with the spaces around, it's why this one was used. The method using LayoutToken should be reviewed/extended I think.
The current dehypenisation method using layout tokens is not complete. I would aim to merge the three methods and produce a single one using layout tokens and having the possibility to have more aggressive approach.
Be careful that dehyphenize must be called only in certain fields where we are sure to have only text, performing it at clusteror level does not seem the right moment, because we still not know what is exactly the type of the current labelled segment.
The idea was to use the clusteror to extract, and apply the dehypenisation after the text is recomposed, not at the same moment.
OK I see, you were talking about the abstract for the clusteror. As I said the old-fashioned Header model is not using LayoutToken for decoding the CRF results, it follows a different logic where the EOL are (voluntarily) not preserved - they were actually used to represent two discontinuous segments for the same field, for instance for keyword or author fields... so the different dehyphenization method (which works fine in the kent.pdf
example).
It would be necessary to rewrite entirely the method HeaderParser.resultExtraction() (with clusteror for decoding CRF results) and pay attention to some other stuff in BiblioItem (there is a special hack to propagate LayoutToken for authors, in order to make bounding boxes for authors present in the TEI - we would need to find a way to generalize that, in order to keep the layout tokens for any fields for creating corresponding bounding boxes).
For me it was a different task, issue #136, to have every aspects updated at the same time - which is why I mentioned that it is quite a lot of work (and also why it is still an open issue since one and half year ;) ). Then the textual fields extracted from the header would be aligned with all the other models, and ready to use the common dehyphenization method.
OK so I will focus on the dehypenise() method using LayoutTokens and we could have a version getting text and tokenizing it under the hood. I will see whether to merge also with the aggressive version or not.
I've implemented something to fix the dehypenisation. I'm sure it will require a couple of iterations. Could someone test it, focusing only on the body (not the abstract)?
I've ran the pubmed end 2 end evaluation.
======= Header metadata =======
Evaluation on 1942 random PDF files out of 1943 PDF (ratio 1.0).
======= Strict Matching ======= (exact matches)
===== Field-level results =====
label accuracy precision recall f1
abstract 82.02 15.67 14.45 15.04
authors 93.12 69.59 67.61 68.58
first_author 97.99 93.65 90.56 92.08
keywords 93.28 69.29 56.27 62.1
title 93.34 71.44 68.68 70.03
all fields 91.95 64.1 59.86 61.91 (micro average)
91.95 63.93 59.51 61.57 (macro average)
======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)
===== Field-level results =====
label accuracy precision recall f1
abstract 88.33 48.38 44.61 46.42
authors 93.22 70.06 68.08 69.06
first_author 98.07 94.03 90.92 92.45
keywords 94.08 75.8 61.57 67.95
title 94.66 77.92 74.91 76.39
all fields 93.67 73.34 68.49 70.83 (micro average)
93.67 73.24 68.02 70.45 (macro average)
==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)
===== Field-level results =====
label accuracy precision recall f1
abstract 94.67 81.26 74.92 77.96
authors 95.84 82.75 80.4 81.56
first_author 98.08 94.08 90.97 92.5
keywords 95.7 89.02 72.3 79.79
title 96.11 84.99 81.71 83.32
all fields 96.08 86.26 80.56 83.31 (micro average)
96.08 86.42 80.06 83.03 (macro average)
= Ratcliff/Obershelp Matching = (Minimum Ratcliff/Obershelp similarity at 0.95)
===== Field-level results =====
label accuracy precision recall f1
abstract 93.63 75.87 69.95 72.79
authors 94.05 74.1 72 73.03
first_author 97.99 93.65 90.56 92.08
keywords 95.15 84.46 68.6 75.71
title 95.3 81.03 77.9 79.43
all fields 95.22 81.66 76.26 78.87 (micro average)
95.22 81.82 75.8 78.61 (macro average)
===== Instance-level results =====
Total expected instances: 1941
Total correct instances: 146 (strict)
Total correct instances: 437 (soft)
Total correct instances: 874 (Levenshtein)
Total correct instances: 710 (ObservedRatcliffObershelp)
Instance-level recall: 7.52 (strict)
Instance-level recall: 22.51 (soft)
Instance-level recall: 45.03 (Levenshtein)
Instance-level recall: 36.58 (RatcliffObershelp)
======= Citation metadata =======
Evaluation on 1942 random PDF files out of 1943 PDF (ratio 1.0).
======= Strict Matching ======= (exact matches)
===== Field-level results =====
label accuracy precision recall f1
authors 97.24 81.29 70.09 75.27
date 98.82 92.2 77.57 84.25
first_author 98.29 88.91 76.55 82.27
inTitle 95.86 71.08 66.85 68.9
issue 99.37 83.63 77.86 80.64
page 98.47 92.11 78.65 84.85
title 96.79 77.43 69.43 73.21
volume 99.05 94.81 82.16 88.03
all fields 97.98 85.15 74.56 79.51 (micro average)
97.98 85.18 74.89 79.68 (macro average)
======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)
===== Field-level results =====
label accuracy precision recall f1
authors 97.32 81.85 70.58 75.8
date 98.82 92.2 77.57 84.25
first_author 98.3 89.03 76.65 82.38
inTitle 97.37 81.62 76.76 79.12
issue 99.37 83.63 77.86 80.64
page 98.47 92.11 78.65 84.85
title 98.31 88.56 79.41 83.74
volume 99.05 94.81 82.16 88.03
all fields 98.37 88.33 77.34 82.47 (micro average)
98.37 87.98 77.45 82.35 (macro average)
==== Levenshtein Matching ===== (Minimum Levenshtein distance at 0.8)
===== Field-level results =====
label accuracy precision recall f1
authors 98.09 87.39 75.35 80.93
date 98.82 92.2 77.57 84.25
first_author 98.33 89.21 76.8 82.54
inTitle 97.52 82.68 77.76 80.15
issue 99.37 83.63 77.86 80.64
page 98.47 92.11 78.65 84.85
title 98.66 91.16 81.73 86.19
volume 99.05 94.81 82.16 88.03
all fields 98.54 89.65 78.5 83.7 (micro average)
98.54 89.15 78.49 83.45 (macro average)
= Ratcliff/Obershelp Matching = (Minimum Ratcliff/Obershelp similarity at 0.95)
===== Field-level results =====
label accuracy precision recall f1
authors 97.65 84.27 72.66 78.04
date 98.82 92.2 77.57 84.25
first_author 98.29 88.93 76.56 82.28
inTitle 97.17 80.23 75.46 77.77
issue 99.37 83.63 77.86 80.64
page 98.47 92.11 78.65 84.85
title 98.54 90.26 80.93 85.34
volume 99.05 94.81 82.16 88.03
all fields 98.42 88.69 77.66 82.81 (micro average)
98.42 88.3 77.73 82.65 (macro average)
===== Instance-level results =====
Total expected instances: 89789
Total extracted instances: 86507
Total correct instances: 35610 (strict)
Total correct instances: 46360 (soft)
Total correct instances: 50433 (Levenshtein)
Total correct instances: 47359 (RatcliffObershelp)
Instance-level precision: 41.16 (strict)
Instance-level precision: 53.59 (soft)
Instance-level precision: 58.3 (Levenshtein)
Instance-level precision: 54.75 (RatcliffObershelp)
Instance-level recall: 39.66 (strict)
Instance-level recall: 51.63 (soft)
Instance-level recall: 56.17 (Levenshtein)
Instance-level recall: 52.74 (RatcliffObershelp)
Instance-level f-score: 40.4 (strict)
Instance-level f-score: 52.59 (soft)
Instance-level f-score: 57.21 (Levenshtein)
Instance-level f-score: 53.73 (RatcliffObershelp)
Matching 1 : 62566
Matching 2 : 3384
Matching 3 : 2786
Matching 4 : 665
Total matches : 69401
======= Fulltext structures =======
Evaluation on 1942 random PDF files out of 1943 PDF (ratio 1.0).
======= Strict Matching ======= (exact matches)
===== Field-level results =====
label accuracy precision recall f1
figure_title 96.55 28.3 23.14 25.46
reference_citation 57.14 55.99 52.72 54.31
reference_figure 94.58 61.04 61.1 61.07
reference_table 99.08 82.87 82.21 82.54
section_title 94.47 74.91 66.88 70.67
table_title 97.44 7.91 8.22 8.06
all fields 89.88 58.19 54.69 56.38 (micro average)
89.88 51.84 49.05 50.35 (macro average)
======== Soft Matching ======== (ignoring punctuation, case and space characters mismatches)
===== Field-level results =====
label accuracy precision recall f1
figure_title 98.4 74.09 60.57 66.65
reference_citation 59.53 60.14 56.63 58.33
reference_figure 94.53 62.06 62.12 62.09
reference_table 99.08 83.39 82.73 83.06
section_title 95.08 79.14 70.65 74.65
table_title 97.57 15.51 16.13 15.81
all fields 90.7 63.24 59.44 61.28 (micro average)
90.7 62.39 58.14 60.1 (macro average)
====================================================================================
@kermitt2 do you see any differences (hopefully a little improvement) with the previuos e2e measures?
There are differences, in particular I see a loss in citation metadatas and improvement on abstract. However the only way to be sure is to run it on the same architecture with and without the fixes (in case you have a branch). It depends also if you use consolidation or not.
I've checked based on the first comment and with 0.5.6 the hypens are safe :-)
I also checked the comment from @borkdude and we improve the results on kent.pdf, @borkdude could you have a look, especially if you have other cases?