tabula-java icon indicating copy to clipboard operation
tabula-java copied to clipboard

Feature request: extract background color of row

Open mvanaltvorst opened this issue 6 years ago • 6 comments

https://github.com/chezou/tabula-py/issues/131 has sent me here.

mvanaltvorst avatar Jan 19 '19 17:01 mvanaltvorst

Hi @mvanaltvorst,

This is a neat idea that could certainly be helpful in some circumstances. It's unlikely that the core maintainers will add it as a top priority, however, if you or another interested user wants to add it and submit a pull request, we'd likely accept it.

cross-referencing with #21

jeremybmerrill avatar Jan 19 '19 18:01 jeremybmerrill

This would indeed be very helpful to do post-processing of the extracted data (such as discarding certain colored rows).

jleclanche avatar Apr 19 '20 23:04 jleclanche

Here's an example: https://usa.visa.com/content/dam/VCOM/download/merchants/visa-merchant-data-standards-manual.pdf (see pages around 50, where the data in white-bg is a different type of data than the grey-bg one).

jleclanche avatar Apr 19 '20 23:04 jleclanche

Commenting to confirm that both background color and text color is quite frequently used to carry information in official documents from governments across Europe, and a feature like this would indeed be useful.

Here's an example of a Swedish official document using both at the same time (text color for gender, background color for nationality): image

rotsee avatar Jun 11 '20 17:06 rotsee

@jeremybmerrill Can you give any pointer to where to start reading if I want to attempt adding support for this? I have never used PdfBox, nor have I worked with PDF's, but it seems to me like colour is not available as a text or textPosition property, but rather through something called a PDOutlineItem(?) I can't find a way to get there from the textPosition thing though...

rotsee avatar Oct 26 '20 16:10 rotsee