Hiromu Hota

Results 94 comments of Hiromu Hota

I looked into this issue and confirmed that it is a pdftotree's bug in the way how it specifies a table area. ``` $ pdftotree table.pdf -o table.hocr -vv [INFO]...

I wonder where this pixel shift happens.

I think I figured out what was happening. When you run pdftotree without `-mt` option, it will detect a table heuristically. https://github.com/HazyResearch/pdftotree/blob/0686a1845c7901aa975544a9107fc10594523986/pdftotree/TreeExtract.py#L256-L259 The heuristic used here is that words are...

A short-term workaround would be to use `-mt` option (probably with `vision`). A long-term fix would be either to fix the heuristics or offload the table detection to tabula.