pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

One PDF where random strings are dropped (depending on filename length?)

Open kcstrong opened this issue 3 years ago • 1 comments

We're currently testing pdfalto. Specifically, we're converting a lot of PDFs to HTML via the XML output of pdfalto (as we were not quite satisfied with the result of any of the pdftohtml tools we tested). Most of the results are excellent, though we are finding some issues. We're using pdfalto v0.5 compiled on Windows via Cygwin as per the instructions.

In this case different strings are being dropped from a PDF (only the one so far), always beginning on or around page 45, apparently depending on the length of the filename. I stumbled upon this observation by accident. E.g. strings are dropped from foo.pdf. If I rename the file foo-99.pdf different strings are dropped. If I rename the file bar.pdf the same strings are dropped as foo.pdf. I've renamed and processed the same file at least ten times and observed that each result differs from every other except where the filename was the same length.

To whomever wants to test this: I can send you the file, but am bound by policy to protect our copyright. If there's a way I can send you the file privately that would be preferable.

Thanks

kcstrong avatar Jun 17 '21 14:06 kcstrong

Hello @kcstrong !

Thanks a lot for the tests and reporting the issues.

I would be happy to try to reproduce the problem with your file and investigate it. You can send the file to my private address, that you find here, first email.

kermitt2 avatar Jun 18 '21 03:06 kermitt2