pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Is there an option to output ALTO XML to STDOUT?

Open Sukii opened this issue 2 years ago • 3 comments

I need it for a down-stream XSLT pipeline; https://gitlab.coko.foundation/XSweet/XSweet/-/tree/pdf2html/applications/pdf2html

Sukii avatar May 22 '22 11:05 Sukii

Hello @Sukii !

There is no such option currently. As the normal use case is to produce several files in addition to the ATLO document to cover information in the PDF that cannot be encoded in ALTO (for annotations, outline, ...), I didn't plan to add it so far. I guess working with files is no problem, the interest of using pipes with stdout/stdin would be to speed up a bit the XSTL transformation?

kermitt2 avatar Jun 03 '22 00:06 kermitt2

Yes, not only the speed improvement, but also that Linux pipes help in sending the output directly to the webservices avoiding possible collisions, racing conditions etc. Of course, the images and stuff like that better remain outside as binary files, so it may be necessary to write that to hard-disk anyway.

Sukii avatar Jun 03 '22 05:06 Sukii

https://gitlab.coko.foundation/XSweet/XSweet/-/tree/pdf2html/applications/pdf2html

Sukii avatar Jun 03 '22 14:06 Sukii