tabula-java
tabula-java copied to clipboard
Feature request: extract background color of row
https://github.com/chezou/tabula-py/issues/131 has sent me here.
Hi @mvanaltvorst,
This is a neat idea that could certainly be helpful in some circumstances. It's unlikely that the core maintainers will add it as a top priority, however, if you or another interested user wants to add it and submit a pull request, we'd likely accept it.
cross-referencing with #21
This would indeed be very helpful to do post-processing of the extracted data (such as discarding certain colored rows).
Here's an example: https://usa.visa.com/content/dam/VCOM/download/merchants/visa-merchant-data-standards-manual.pdf (see pages around 50, where the data in white-bg is a different type of data than the grey-bg one).
Commenting to confirm that both background color and text color is quite frequently used to carry information in official documents from governments across Europe, and a feature like this would indeed be useful.
Here's an example of a Swedish official document using both at the same time (text color for gender, background color for nationality):
@jeremybmerrill Can you give any pointer to where to start reading if I want to attempt adding support for this? I have never used PdfBox, nor have I worked with PDF's, but it seems to me like colour is not available as a text or textPosition
property, but rather through something called a PDOutlineItem
(?) I can't find a way to get there from the textPosition
thing though...