pdfalto option cutPages not working?

option cutPages not working?

Open nicolasfranck opened this issue 6 years ago • 2 comments

The option -cutPages does not seem to be working:

without that option, all textlines are extracted

pdfalto -blocks input.pdf

but with the option enabled..

pdfalto -blocks -cutPages input.pdf

only the metadata is extracted into input.xml, without textlines.

I've looked at the source, and noticed something strange here:

https://github.com/kermitt2/pdfalto/blob/b19381516e6fa6ce077f9b0a156dd7cfa07c08da/src/pdfalto.cc#L228

That means: if the option is set on the CLI, then it is set to false?

May 24 '19 11:05 nicolasfranck

Hello @nicolasfranck , thanks for reporting the issue! It's not working, but it's not the setting of the option in src/pdfalto.cc, there's a piece missing in TextPage::endPage (originally in pdf2xml, see https://github.com/kermitt2/pdf2xml/blob/master/src/XmlOutputDev.cc#L920)

Although it should not be complicated to put it back, it adds some overall complexity (it has an impact on the extracted images), I am wondering if this option is very useful actually, what do you think?

May 31 '19 05:05 kermitt2

Normally an ocr tool delivers altoxml for each image (so only one page), instead of combining it into one file. That makes it easy to associate the ocr with the image. So this option could mimic that behaviour.

Of course, one can also do this manually

May 31 '19 12:05 nicolasfranck

pdfalto pdfalto copied to clipboard

option cutPages not working?

pdfalto
pdfalto copied to clipboard