pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

PDF to XML ALTO file converter

Results 85 pdfalto issues
Sort by recently updated
recently updated
newest added

We use currently simple formatting patterns like `%1.4f` to serialize the coordinates in the XML and SVG files (avoiding `e` formatting that can introduce an exponential). The drawback is that...

enhancement

I believe there aren't currently any unit tests (apologies if there are)... It would be good to have some unit tests, e.g. for the functionality that decides when to start...

enhancement

For some reason, the rotation attribute which was present in pdf2xml and which is still computed, is not outputted in the ALTO file presently. If I remember well, we though...

enhancement

It looks like icu-project.org redesigned/reorganised their website(s) and the URL used in install_deps.sh (ICU_URI) is not working anymore. Also, the new download page (http://site.icu-project.org/download) doesn't seem to be listing ICU4C...

implemented

When parsing the attached pdf file the letters are parsed as separate entities. ` ` Note that the file has watermark type text plastered across, using Apache PDFBox to remove...

implemented

There are some PDF files that xpdf 4.00 can't open but the latest version of xpdf can open without error. I updated the local `xpdf` fork to the latest version...

Hi @kermitt2 I have now merged with upstream master and during evaluation I found some error cases where the line numbers are not filtered out. I can confirm that the...

implemented

The version of icu that is downloaded in the script (icu4c-62_2-src.tgz) does not match the name of the file the script is trying to unzip (icu4c-62_1-src.tgz). This causes the script...

implemented

I was trying to track down why running GROBID locally produced different results to when compared to running it via Docker. In the end it seems that the output of...

implemented

kermitt2/pdf2xml/issues/5

enhancement