pdftotree Loss of information oftentimes in the last line of a table

Describe the bug I've tried the plain pdftotree command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is usually not captured in the output hOCR file.

May I ask is that an expected behavior, or it has something to do with the extract_tables utility?

To Reproduce Steps to reproduce the behavior:

sample pdf downloaded from https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf
run pdftotree pdf/table.pdf" -o hocr/table.hocr
check hOCR output

Expected behavior The last line of the table is not extracted in the output.

Environment (please complete the following information):

OS: macOS 10.15.6
pdftotree Version: 0.5.0
pdfminer.six Version: 20200726

Additional context Same behaviors occurred on a few other files I used.

Nov 12 '20 18:11 linM24

This is not an expected behaviour. In addition to the missing last row of the table, I can see some duplicates of cells. However, this may not be pdftotree's bug as it relies on tabula for the table recognition. I'd appreciate if you could try directly tabula-py on the same pdf.

Nov 13 '20 06:11 HiromuHota

Yea, that's also what I thought.

Will do! Thanks

Nov 16 '20 18:11 linM24

This is not an expected behaviour. In addition to the missing last row of the table, I can see some duplicates of cells. However, this may not be pdftotree's bug as it relies on tabula for the table recognition. I'd appreciate if you could try directly tabula-py on the same pdf.

Sorry for the delay. It turns out tabula works fine on the PDFs I used. Although sometimes it may not be able to accurately convert a table structure into a dataframe or JSON, the pure text information is fully preserved. So I suspect there might be a minor problem in the pipeline of parsing the output of tabula-py.

Dec 11 '20 22:12 linM24

I looked into this issue and confirmed that it is a pdftotree's bug in the way how it specifies a table area.

$ pdftotree table.pdf -o table.hocr -vv
[INFO] pdftotree.core - Digitized PDF detected, building tree structure...
[WARNING] pdftotree.utils.pdf.pdf_parsers - No boxes to get figures from on page 1.
[INFO] pdftotree.core - Tree structure built, creating html...
[DEBUG] pdftotree.TreeExtract - Calling tabula at page: 1 and area: (146.20799999999997, 90.0, 331.78175999999996, 539.4936).
[DEBUG] pdftotree.TreeExtract - Tabula recognized 1 table(s).
[INFO] pdftotree.core - HTML created.
hOCR output to table.hocr

As can be seen in the log message, pdftotree specified a table area as (146.20799999999997, 90.0, 331.78175999999996, 539.4936) (top, left, bottom, right). This is actually a few pixels smaller than the actual table. Screen Shot 2020-12-12 at 16 35 19

Dec 13 '20 00:12 HiromuHota

I wonder where this pixel shift happens.

Dec 13 '20 00:12 HiromuHota

I think I figured out what was happening. When you run pdftotree without -mt option, it will detect a table heuristically. https://github.com/HazyResearch/pdftotree/blob/0686a1845c7901aa975544a9107fc10594523986/pdftotree/TreeExtract.py#L256-L259

The heuristic used here is that words are vertically aligned in a table. https://github.com/HazyResearch/pdftotree/blob/0686a1845c7901aa975544a9107fc10594523986/pdftotree/utils/pdf/pdf_parsers.py#L54-L66

So the table area detected by this heuristic: (146.20799999999997, 90.0, 331.78175999999996, 539.4936) is actually correct in the way how a table is detected. This area covers all the words in the table. However it does not include the table border lines.

Dec 13 '20 04:12 HiromuHota

A short-term workaround would be to use -mt option (probably with vision). A long-term fix would be either to fix the heuristics or offload the table detection to tabula.

Dec 13 '20 04:12 HiromuHota

pdftotree pdftotree copied to clipboard

Loss of information oftentimes in the last line of a table

pdftotree
pdftotree copied to clipboard