pdfalto Blank image overlay

Blank image overlay

Open de-code opened this issue 2 years ago • 2 comments

For the example 471433v1 (from bioRxiv 10k training dataset), there is an image that is extracted with a blank overlay, with the same coordinates as the true image figure.

In a PDF viewer the document is rendered fine.

PDF

PDFAlto XML (extract)

<Illustration ID="p13_i1" HPOS="73.2000" VPOS="484.545" WIDTH="438.250" HEIGHT="134.640" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-3.png" TYPE="image"/>
<Illustration ID="p13_i2" HPOS="73.2000" VPOS="484.545" WIDTH="438.250" HEIGHT="134.640" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-4.png" TYPE="image"/>
<Illustration ID="p13_s5902" HPOS="0.2800" VPOS="72.0000" WIDTH="595.200" HEIGHT="554.440" ROTATION="0.000000" FILEID="471433v1.lxml_data/image-13.svg" TYPE="svg"/>

In this case image-3.png is the true figure image. Whereas image-4.png appears to be blank / white (not transparent).

I am not sure how to interpret the order. Due to the order I would think image-4.png is on top of image-3.png. And maybe it is missing transparency? Alternatively, is the order meant to be the opposite?

Sep 18 '21 14:09 de-code

Hi Daniel,

For the white image, it's probably the same as what I raised here: https://github.com/kermitt2/grobid/issues/826

I think these are the "Soft-Mask" images of the PDF specifications (11.6.5.3 Soft-Mask Images, page 347).

Currently they are treated as usual images and but they are typed/distinguished in xpdf via the dictionary and there is a distinct ImageOutputDev methods for them. So we could probably mark these images with an attribute in the ALTO file or with the file name by defining some pattern, and/or add a parameter in the command line to output them or not - if you have a preference ?

Dec 03 '21 11:12 kermitt2

Hi Daniel,

For the white image, it's probably the same as what I raised here: kermitt2/grobid#826

I think these are the "Soft-Mask" images of the PDF specifications (11.6.5.3 Soft-Mask Images, page 347).

Currently they are treated as usual images and but they are typed/distinguished in xpdf via the dictionary and there is a distinct ImageOutputDev methods for them. So we could probably mark these images with an attribute in the ALTO file or with the file name by defining some pattern, and/or add a parameter in the command line to output them or not - if you have a preference ?

Hi Patrice,

Thank you for getting back on it and explaining the issue.

Personally I would prefer the first option you mentioned, to add markup / attribute to the ALTO XML output. (not sure if there is already something in the schema that seems appropriate)

Then reflecting it in the filename or adding a command line argument could be an optional extra. But it should be easy to post process based on the XML, depending on the use case. (Who knows, the masking image could be useful)

Dec 03 '21 12:12 de-code

pdfalto pdfalto copied to clipboard

Blank image overlay

pdfalto
pdfalto copied to clipboard