grobid
grobid copied to clipboard
Dealing with the Invisible :)
There's a relatively high number of PDF with hidden content (usually white on white), which impacts more or less severely the grobid processing.
Two cases I want to make visible/highlight here :)
- the SVG white caches:
https://academic.oup.com/pcp/article/44/10/1055/1868002?login=true
These cases tend to lead to the creation of figures in the middle of nowhere, because of the proximity of these hidden graphic elements. The problem with SVG is that we have all the SVG graphics of a page in one single SVG document, so we need to deal with invisibility after clustering the SVG elements of the page (looking at the style information associated to the groups).
- the repeated hidden text:
mbrane association and
α
syn multimerization. Conversely, ATP13A2 WT plays a
protective role against
α
syn multimerization by maintaining the integrity of the lysosomal
membrane and by inhibiting
α
syn membrane association and multimerization. We also
found that it regulates the ubiquitin-prot
easome system (UPS) and nanovesicle-based
external secretion to r
see https://github.com/DataSeer/dataseer-web/issues/475
We should consider the introduction of a systematic way to detect and neutralize these cases.
By the way, apart harming on purpose document mining, I have no clue at the end what's exactly the goal of these spurious white elements, since they either repeats existing text or cache white by some white, so in both cases doing nothing apparently meaningful.