pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Need DPI option.

Open eighttails opened this issue 4 years ago • 2 comments

I want to generate annotated image files to train OCR.

wget https://ia800902.us.archive.org/14/items/arxiv-0704.0646/0704.0646.pdf
pdfalto 0704.0646.pdf 0704.0646.xml

The generated alto file shows page WIDTH and HEIGHT is 612 and 792. It assumes dpi is always 72.

The pdf is vector based and can take any DPI. I generated 300dpi images from the pdf and I want ALTO file as 300dpi. Please consider adding --dpi option to set DPI manually.

eighttails avatar Apr 26 '20 03:04 eighttails

Hello @eighttails !

Thank you for the feature request.

Yes, we keep now the PDF point values which are "independent" from any resolution in the ALTO file. Then I was thinking that, like a PDF, the values could be scaled to any resolutions - assuming that the tool using the ALTO file would scale the values accordingly to its needs.

But an ALTO file has indeed normally a "physical" value unit, and adding a --dpi option makes a lot of sense - we will try to add it in a future version.

kermitt2 avatar Jun 04 '20 17:06 kermitt2

@kermitt2 I'd like to add a +1 for option -dpi . In our deployment (see https://github.com/esmero/strawberryfield) we are using IAB with Solr highlighting (great module https://github.com/dbmdz/solr-ocrhighlighting) will be really useful to have ALTO with dimensions scaled to pixel (i.e. points / 72 * dpi) to avoid a lot of overhead calculation when rendering. Anyway, thanks a lot again for this great pdfalto command!!

giancarlobi avatar May 12 '21 16:05 giancarlobi