img2table icon indicating copy to clipboard operation
img2table copied to clipboard

Missing column header content

Open EnricoRausch opened this issue 1 year ago • 1 comments

I tried to extract a table from an Image-PDF file and encountered an anomaly where one of its column headers, specifically "Descrição," was missing. It is a bit specific, because its one pdf file that did not come with its header complete, but perhaps it is something more intersting to understand and incorporate to the code.

In my example, I have a header with multiple rows, since it's cells have big texts that break into many lines. It is not actually a many-header-rows table, but it is recognized as because of these line breaks.

Trying to understand from where this issue is coming, I tried to extract the same pdf but as image. And it worked, no content was missing. Also, the OCR is extracting properly as well in both cases.

Since the header was broke into many rows, my understanding of this issue is that img2table is losing a text that is "inter-rows", since the text is centered.

The missing column header is "Descrição" and can be seen in the attached image.

Ps: The file is in portuguese, but in this case, I don't think that the language has something to do with the issue, since the OCR that does the text extraction.

teste_desc_zuada_multilinhas.pdf image

Please, let me know if you have any questions, it would be very good to have a feedback on this -maybe- issue.

EnricoRausch avatar Jan 19 '24 21:01 EnricoRausch

Hello, I took a look at it and this is due to the poor quality of the table header that messes up the table detection. As of now, I won't be able/have not found a way to fix it without degrading the overall performance of the library

xavctn avatar Feb 11 '24 20:02 xavctn