pdftotree
pdftotree copied to clipboard
Loss of information oftentimes in the last line of a table
Describe the bug
I've tried the plain pdftotree
command line utility on a few pdf files with tables, and found wherever there is a table structure, the last line is usually not captured in the output hOCR file.
May I ask is that an expected behavior, or it has something to do with the extract_tables
utility?
To Reproduce Steps to reproduce the behavior:
- sample pdf downloaded from https://www.w3.org/WAI/WCAG20/Techniques/working-examples/PDF20/table.pdf
- run
pdftotree pdf/table.pdf" -o hocr/table.hocr
- check hOCR output
Expected behavior The last line of the table is not extracted in the output.
Environment (please complete the following information):
- OS: macOS 10.15.6
-
pdftotree
Version: 0.5.0 -
pdfminer.six
Version: 20200726
Additional context Same behaviors occurred on a few other files I used.
This is not an expected behaviour. In addition to the missing last row of the table, I can see some duplicates of cells. However, this may not be pdftotree's bug as it relies on tabula for the table recognition. I'd appreciate if you could try directly tabula-py on the same pdf.
Yea, that's also what I thought.
Will do! Thanks
This is not an expected behaviour. In addition to the missing last row of the table, I can see some duplicates of cells. However, this may not be pdftotree's bug as it relies on tabula for the table recognition. I'd appreciate if you could try directly tabula-py on the same pdf.
Sorry for the delay. It turns out tabula works fine on the PDFs I used. Although sometimes it may not be able to accurately convert a table structure into a dataframe or JSON, the pure text information is fully preserved. So I suspect there might be a minor problem in the pipeline of parsing the output of tabula-py.
I looked into this issue and confirmed that it is a pdftotree's bug in the way how it specifies a table area.
$ pdftotree table.pdf -o table.hocr -vv
[INFO] pdftotree.core - Digitized PDF detected, building tree structure...
[WARNING] pdftotree.utils.pdf.pdf_parsers - No boxes to get figures from on page 1.
[INFO] pdftotree.core - Tree structure built, creating html...
[DEBUG] pdftotree.TreeExtract - Calling tabula at page: 1 and area: (146.20799999999997, 90.0, 331.78175999999996, 539.4936).
[DEBUG] pdftotree.TreeExtract - Tabula recognized 1 table(s).
[INFO] pdftotree.core - HTML created.
hOCR output to table.hocr
As can be seen in the log message, pdftotree specified a table area as (146.20799999999997, 90.0, 331.78175999999996, 539.4936) (top, left, bottom, right).
This is actually a few pixels smaller than the actual table.
I wonder where this pixel shift happens.
I think I figured out what was happening.
When you run pdftotree without -mt
option, it will detect a table heuristically.
https://github.com/HazyResearch/pdftotree/blob/0686a1845c7901aa975544a9107fc10594523986/pdftotree/TreeExtract.py#L256-L259
The heuristic used here is that words are vertically aligned in a table. https://github.com/HazyResearch/pdftotree/blob/0686a1845c7901aa975544a9107fc10594523986/pdftotree/utils/pdf/pdf_parsers.py#L54-L66
So the table area detected by this heuristic: (146.20799999999997, 90.0, 331.78175999999996, 539.4936) is actually correct in the way how a table is detected. This area covers all the words in the table. However it does not include the table border lines.
A short-term workaround would be to use -mt
option (probably with vision
).
A long-term fix would be either to fix the heuristics or offload the table detection to tabula.