pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Feature Request: Additional Styles

Open de-code opened this issue 5 years ago • 3 comments

Currently the following styles seem to be supported by pdfalto:

  • italic
  • bold
  • subscript
  • superscript

Some other styles we would be interested in:

  • underline
  • sc
  • monospace
  • strike
  • overline
  • roman (although not found in bioRxiv dataset)

I also have named-content on the list but I am not sure whether that really falls into the styles category.


Examples:

italic

462929v1 (10.1101/462929)

PDF:

image

bioRxiv XML:

would start with loss of the <italic>APC</italic> gene, followed by mutations in <italic>KRAS</italic> or <italic>BRAF</italic> genes, mutations or loss of <italic>TP53</italic> gene and of SMAD family member 4 (<italic>SMAD4</italic>) [<xref ref-type="bibr" rid="c4">4</xref>].</p>

GROBID 0.6.1 seems to be missing large sections of the Introduction section.

Using one of the more recent bioRxiv trained models:

would start with loss of the APC gene, followed by mutations in KRAS or BRAF genes, mutations or loss of TP53 gene and of SMAD family member 4 (SMAD4) 

No italic in sight.

sc (Small Caps)

188706v1 (10.1101/188706)

PDF:

image

bioRxiv XML:

<p>Finally, <sc>ndmg</sc> produces a QA plot

473744v1 (10.1101/473744)

PDF:

image

bioRxiv XML, probably not correct:

<title>A<sc>bstract</sc></title>
underline

218479v1 (10.1101/218479)

PDF:

image

bioRxiv XML:

<contrib-group>
    <!-- .... -->
    <aff id="a1"><label>1</label><institution>Biology Department, University of Pennsylvania</institution>, Philadelphia, PA 19104</aff>
    <aff id="a2"><label>2</label><institution>Neuroscience Graduate Group, University of Pennsylvania</institution>, Philadelphia, PA 19104</aff>
    <aff id="a3"><label>3</label><institution>Biological Basis of Behavior Program, University of Pennsylvania</institution>, Philadelphia, PA 19104</aff>
    <aff id="a4"><label>4</label><underline>Current Address</underline>: <institution>The Lockwood Group</institution>, Stamford, CT 06901.</aff>
    <aff id="a5"><label>5</label><underline>Current Address</underline>: <institution>Department of Neuroscience, University of Pennsylvania</institution>, Philadelphia, PA 19104</aff>
</contrib-group>
<author-notes>
    <corresp id="cor1"><underline>Corresponding Author</underline>: <underline>Email</underline>: <email>[email protected]</email></corresp>
    <corresp id="cor2"><label>&#x002A;</label>Both authors contributed equally to this work</corresp>
</author-notes>
monospace

473744v1 (10.1101/473744)

PDF:

image

bioRxiv XML:

normalized using the <monospace>Normalize_Variant</monospace> function
strike

Only two documents of the 6000 training examples are containing <strike>.

701052v1 (10.1101/701052)

PDF:

image

bioRxiv XML:

using antifungals and absence of oral lesions. <strike>The data collection, the oral mucosa examination, were performed according to Pieralisi et al., 2016. This study was conducted according to the Resolution 466/2012 of the National Health Council and was previously approved by the Ethics Committee for the Research Involving Humans of the State University of Maring&#x00E1;, Brazil [COPEP-EMU n&#x00B0; 383979, CAEE resolution n&#x00B0; 17297713.2.0000.0104].</strike></p>
overline

(not many examples)

362053v1 (10.1101/362053)

PDF:

image

bioRxiv XML:

denotes the expectation and <overline><italic>t<sub>a</sub></italic></overline> and <overline><italic>t<sub>b</sub></italic></overline>

de-code avatar Oct 15 '20 18:10 de-code

Do you know which style can be obtained from the font full name?

The way to get italic or bold information is actually very hacky https://github.com/kermitt2/pdfalto/blob/master/src/XmlAltoOutputDev.cc#L598 but it's the normal way...

In pdfalto, superscript and subscript are detected by analysing positions wrt. to base line and top of the line and modification of the font size.

kermitt2 avatar Apr 07 '21 21:04 kermitt2

I haven't looked into it yet. I imagine it not being trivial. I believe I have seen detection based on the font name elsewhere. In some cases even via a font name map. As the first step I would probably also check what poppler or similar tools can already do.

I suppose if the font name is not sufficient, then perhaps analysing the glyph might do. Might end up with a machine learning model ;-)

de-code avatar Apr 07 '21 21:04 de-code

Thinking about it, I have some vague recollection someone at the GROBID camp in Paris was going to work on something related. Perhaps Pedro? I am probably just making it up.

de-code avatar Apr 07 '21 21:04 de-code