grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Dealing with the Invisible :)

Open kermitt2 opened this issue 2 years ago • 0 comments

There's a relatively high number of PDF with hidden content (usually white on white), which impacts more or less severely the grobid processing.

Two cases I want to make visible/highlight here :)

  • the SVG white caches:

Screenshot from 2021-08-27 08-00-59 Screenshot from 2021-08-27 08-00-39

https://academic.oup.com/pcp/article/44/10/1055/1868002?login=true

These cases tend to lead to the creation of figures in the middle of nowhere, because of the proximity of these hidden graphic elements. The problem with SVG is that we have all the SVG graphics of a page in one single SVG document, so we need to deal with invisibility after clustering the SVG elements of the page (looking at the style information associated to the groups).

  • the repeated hidden text:

Screen Shot 2021-08-27 at 07 39 28

mbrane  association  and  
α
syn  multimerization.  Conversely,  ATP13A2  WT  plays  a  
protective role against 
α
syn multimerization by maintaining the integrity of the lysosomal 
membrane and by inhibiting 
α
syn membrane association and multimerization. We also 
found  that  it  regulates  the  ubiquitin-prot
easome  system  (UPS)  and  nanovesicle-based  
external secretion to r

see https://github.com/DataSeer/dataseer-web/issues/475

We should consider the introduction of a systematic way to detect and neutralize these cases.

By the way, apart harming on purpose document mining, I have no clue at the end what's exactly the goal of these spurious white elements, since they either repeats existing text or cache white by some white, so in both cases doing nothing apparently meaningful.

kermitt2 avatar Aug 27 '21 06:08 kermitt2