grobid
grobid copied to clipboard
Sub/superscript are displayed as plain text characters in the TEI output
First re-flexion, identify piece of text as sub/superscript based on position, fonts, etc.
Hey there, I had a quick question. I just started tinkering with grobid and I was wondering if the superscript/subscript identification can be added through training such as giving the following training data:
β-cell Endoplasmic Reticulum Ca2+
<titleStmt>
<title level="a" type="main">β-cell Endoplasmic Reticulum Ca<sup>2+</sup></title>
</titleStmt>
thanks for the input
subscript and superscript flags are attached to the tokens so we could serialize with <sup>
and <sub>
elements yes.
Similarly we could add <hi>
for bold and italic tokens.
I'm starting to work on implementing this feature.
What should be done when the token contains combinations? Like italic + bold
, or italic+bold+superscript
?
Also it seems that the place to add this part would be in the TEIFormatter.java
which is quite big already. In particular, I wish I could avoid have to modify the method segmentIntoSentences
but it seems quite hard not to...
@kermitt2 any advice on this?
With the current recognition, the "style" features could support indeed in principle at least italic, bold, superscript/subscript.
The TEI guidelines introduce <hi>
to encode "graphically distinct" text and there is no constraint on the values, see here. We often see values space-separated (for example <hi rend="bold italic superscript">
).
In TEI, there's also <rendition>
element which uses CSS, which might be more predictable and would not required to further customize the XML schema.
I think what's complicated are the relations and the possible clash with other structures/tagging.
-
when this style information should be ignored: For instance a reference marker is often in bold or a superscript number. But the logical "reference" structure is already captured by the
<ref>
labeling and there is no point in keeping rendering information here. The<hi>
is only relevant to text without any other explicit other logical mark-up. The only exception I think would be superscript/subscript inside a formula. -
to maintain hierarchical structures: The
<hi>
element would be an inline annotation (like<ref>
) so it is always under structure tags like<p>
or<s>
. This is indeed complicated for sentence segmentation, because the sentence segments are introduced after the initial serialization, directly on the TEI objects (this was to simplify the serialization! working with a tree structure ensures that we have well-formed XML at the end). The<hi>
tags should be manageable as the<ref>
tags insegmentIntoSentences
- except if the bold/italic/etc covers more than the sentence. If a highlight style to be labeled covers more than the current sentence, we would need to close it with the end of the sentence, and re-open it in the next sentence.
I think it's now implemented by injecting <hi rend="bold italic">
. The flow of decoration is interrupted by references (that was easy) and sentences (that was a pain).
I've also tried to modularise a bit the code in methods, so that could be unit tested as different components.
I tried not to run the realignment of the code 😅 which usually make a mess...
I'm sending some examples: Examples.zip