grobid Sub/superscript are displayed as plain text characters in the TEI output

Sub/superscript are displayed as plain text characters in the TEI output

Open MedKhem opened this issue 7 years ago • 5 comments

First re-flexion, identify piece of text as sub/superscript based on position, fonts, etc.

Feb 13 '17 15:02 MedKhem

Hey there, I had a quick question. I just started tinkering with grobid and I was wondering if the superscript/subscript identification can be added through training such as giving the following training data:

β-cell Endoplasmic Reticulum Ca²⁺

<titleStmt>
  <title level="a" type="main">β-cell Endoplasmic Reticulum Ca<sup>2+</sup></title>
</titleStmt>

thanks for the input

Apr 20 '17 15:04 benjaminkreen

subscript and superscript flags are attached to the tokens so we could serialize with <sup> and <sub> elements yes. Similarly we could add <hi> for bold and italic tokens.

Aug 13 '20 18:08 kermitt2

I'm starting to work on implementing this feature.

What should be done when the token contains combinations? Like italic + bold, or italic+bold+superscript?

Also it seems that the place to add this part would be in the TEIFormatter.java which is quite big already. In particular, I wish I could avoid have to modify the method segmentIntoSentences but it seems quite hard not to...

@kermitt2 any advice on this?

Jul 20 '22 06:07 lfoppiano

With the current recognition, the "style" features could support indeed in principle at least italic, bold, superscript/subscript.

The TEI guidelines introduce <hi> to encode "graphically distinct" text and there is no constraint on the values, see here. We often see values space-separated (for example <hi rend="bold italic superscript">).

In TEI, there's also <rendition> element which uses CSS, which might be more predictable and would not required to further customize the XML schema.

I think what's complicated are the relations and the possible clash with other structures/tagging.

when this style information should be ignored: For instance a reference marker is often in bold or a superscript number. But the logical "reference" structure is already captured by the <ref> labeling and there is no point in keeping rendering information here. The <hi> is only relevant to text without any other explicit other logical mark-up. The only exception I think would be superscript/subscript inside a formula.
to maintain hierarchical structures: The <hi> element would be an inline annotation (like <ref>) so it is always under structure tags like <p> or <s>. This is indeed complicated for sentence segmentation, because the sentence segments are introduced after the initial serialization, directly on the TEI objects (this was to simplify the serialization! working with a tree structure ensures that we have well-formed XML at the end). The <hi> tags should be manageable as the <ref> tags in segmentIntoSentences - except if the bold/italic/etc covers more than the sentence. If a highlight style to be labeled covers more than the current sentence, we would need to close it with the end of the sentence, and re-open it in the next sentence.

Jul 20 '22 10:07 kermitt2

I think it's now implemented by injecting <hi rend="bold italic">. The flow of decoration is interrupted by references (that was easy) and sentences (that was a pain).

I've also tried to modularise a bit the code in methods, so that could be unit tested as different components.

I tried not to run the realignment of the code 😅 which usually make a mess...

I'm sending some examples: Examples.zip

Jul 28 '22 02:07 lfoppiano

grobid grobid copied to clipboard

Sub/superscript are displayed as plain text characters in the TEI output

grobid
grobid copied to clipboard