pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Transparent Strings

Open bmorton1 opened this issue 6 years ago • 3 comments

There are certain instances in PDFs I have analyzed, where there is transparent text. Has there been any thought to adding a transparency channel (#ARGB) to make any rendering from the XML text accurate as far as transparent strings are concerned? I know this is probably a pretty specific ask, but I was just curious?

bmorton1 avatar Nov 26 '19 20:11 bmorton1

Could you provide any samples, just to make sure what you mean by transparent strings.

Aazhar avatar Nov 29 '19 18:11 Aazhar

I did a little more research and these "hidden" strings may just be the result of a layered PDF? Does PDFALTO extract optional content groups or indicate them in any way?

bmorton1 avatar Dec 02 '19 17:12 bmorton1

Hello @bmorton1 !

Good question.

Currently in pdfalto we use RGB format for color, so no alpha channel for transparency (like ARGB), following the ALTO specifications:

<xsd:attribute name="FONTCOLOR" type="xsd:hexBinary" use="optional"><xsd:annotation><xsd:documentation>Font color as RGB value</xsd:documentation></xsd:annotation></xsd:attribute>

As ALTO is designed for OCR, invisible text is not so relevant here :)

Usually the "invisible" text is used as watermarked and it's white text on white background or super smal size font that is invisible when rendered at pixel level. Alpha channel can be used for graphics element in PDF I think, I guess for fonts too. If you have interesting examples to provide with PDF having fonts with alpha level specified, it would be great.

xpdf has an attributes isInvisible for tokens, but I didn't find how it is set.

For pdfalto, it is text stream as usual and it will be added in the output, associated to the "invisible" font defined in <TextStyle> with its @FONTCOLOR in particular which would be might be white. As mentioned above, @FONTCOLOR is RRGGBB format, it has no alpha channel. Using ARGB would suppose to extend the ALTO format to have an additional font color attribute, but it would suppose too the capture of alpha channel upstream by xpdf which I didn't find.

kermitt2 avatar Dec 02 '19 21:12 kermitt2