pdfalto
pdfalto copied to clipboard
Randomly omitted characters
I was trying to track down why running GROBID locally produced different results to when compared to running it via Docker. In the end it seems that the output of pdfalto
can change randomly.
For example given 262469v1.pdf (I attached the exact version I was using).
262469v1
is from the biorxiv 10k test dataset (please do not use it for training purpose).
262469v1
is one of the documents with spacing issues. But I would still expect it to produce the same results.
When I run the following command:
docker run --rm \
-v $PWD/data:/data \
lfoppiano/grobid:0.6.1 \
"/opt/grobid/grobid-home/pdf2xml/lin-64/pdfalto" \
-noImageInline -fullFontName -noLineNumbers -noImage -annotation -filesLimit 2000 \
"/data/pdf/262469v1.pdf" \
"/data/pdf/lxml-docker-0.6.1-direct/262469v1.lxml"
Then the result doesn't seem to be exactly the same. There are some characters that appear to be randomly omitted.
md5sum
for pdfalto
is 871e22e83833f773dae2b2f5e70df8ae
(Linux x64).
262469v1.pdf.gz 262469v1_0.6.1_run_1_formatted.lxml.gz 262469v1_0.6.1_run_2_formatted.lxml.gz
(I formatted the results using xmllint
)
I noticed the same problem, and I thought that this problem did not occur on Linux. Now, this seems to prove the opposite.
Maybe related: #95
Thank you for the error case and #108 - the error was introduced with the processing of line numbers... this is a priority on my next iteration on pdfalto.