pdfalto icon indicating copy to clipboard operation
pdfalto copied to clipboard

Randomly omitted characters

Open de-code opened this issue 3 years ago • 2 comments

I was trying to track down why running GROBID locally produced different results to when compared to running it via Docker. In the end it seems that the output of pdfalto can change randomly.

For example given 262469v1.pdf (I attached the exact version I was using). 262469v1 is from the biorxiv 10k test dataset (please do not use it for training purpose).

262469v1 is one of the documents with spacing issues. But I would still expect it to produce the same results.

When I run the following command:

docker run --rm \
  -v $PWD/data:/data \
  lfoppiano/grobid:0.6.1 \
  "/opt/grobid/grobid-home/pdf2xml/lin-64/pdfalto" \
  -noImageInline -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 \
  "/data/pdf/262469v1.pdf" \
  "/data/pdf/lxml-docker-0.6.1-direct/262469v1.lxml"

Then the result doesn't seem to be exactly the same. There are some characters that appear to be randomly omitted.

md5sum for pdfalto is 871e22e83833f773dae2b2f5e70df8ae (Linux x64).

262469v1.pdf.gz 262469v1_0.6.1_run_1_formatted.lxml.gz 262469v1_0.6.1_run_2_formatted.lxml.gz

(I formatted the results using xmllint)

de-code avatar Nov 23 '20 18:11 de-code

I noticed the same problem, and I thought that this problem did not occur on Linux. Now, this seems to prove the opposite.

Maybe related: #95

lfoppiano avatar Dec 10 '20 23:12 lfoppiano

Thank you for the error case and #108 - the error was introduced with the processing of line numbers... this is a priority on my next iteration on pdfalto.

kermitt2 avatar Dec 11 '20 00:12 kermitt2