pdfalto
pdfalto copied to clipboard
option cutPages not working?
The option -cutPages does not seem to be working:
without that option, all textlines are extracted
pdfalto -blocks input.pdf
but with the option enabled..
pdfalto -blocks -cutPages input.pdf
only the metadata is extracted into input.xml, without textlines.
I've looked at the source, and noticed something strange here:
https://github.com/kermitt2/pdfalto/blob/b19381516e6fa6ce077f9b0a156dd7cfa07c08da/src/pdfalto.cc#L228
That means: if the option is set on the CLI, then it is set to false?
Hello @nicolasfranck , thanks for reporting the issue! It's not working, but it's not the setting of the option in src/pdfalto.cc, there's a piece missing in TextPage::endPage (originally in pdf2xml, see https://github.com/kermitt2/pdf2xml/blob/master/src/XmlOutputDev.cc#L920)
Although it should not be complicated to put it back, it adds some overall complexity (it has an impact on the extracted images), I am wondering if this option is very useful actually, what do you think?
Normally an ocr tool delivers altoxml for each image (so only one page), instead of combining it into one file. That makes it easy to associate the ocr with the image. So this option could mimic that behaviour.
Of course, one can also do this manually