pdfalto
pdfalto copied to clipboard
PDF to XML ALTO file converter
We use currently simple formatting patterns like `%1.4f` to serialize the coordinates in the XML and SVG files (avoiding `e` formatting that can introduce an exponential). The drawback is that...
I believe there aren't currently any unit tests (apologies if there are)... It would be good to have some unit tests, e.g. for the functionality that decides when to start...
For some reason, the rotation attribute which was present in pdf2xml and which is still computed, is not outputted in the ALTO file presently. If I remember well, we though...
It looks like icu-project.org redesigned/reorganised their website(s) and the URL used in install_deps.sh (ICU_URI) is not working anymore. Also, the new download page (http://site.icu-project.org/download) doesn't seem to be listing ICU4C...
When parsing the attached pdf file the letters are parsed as separate entities. ` ` Note that the file has watermark type text plastered across, using Apache PDFBox to remove...
There are some PDF files that xpdf 4.00 can't open but the latest version of xpdf can open without error. I updated the local `xpdf` fork to the latest version...
Hi @kermitt2 I have now merged with upstream master and during evaluation I found some error cases where the line numbers are not filtered out. I can confirm that the...
The version of icu that is downloaded in the script (icu4c-62_2-src.tgz) does not match the name of the file the script is trying to unzip (icu4c-62_1-src.tgz). This causes the script...
I was trying to track down why running GROBID locally produced different results to when compared to running it via Docker. In the end it seems that the output of...