grobid
grobid copied to clipboard
Same document, different PDF files, same curl command, predictably different output.
I have 2 PDF versions of a paper, which look exactly the same when inspected visually. The only difference I can detect is file size (2.2MB vs 900KB) and the fact that my PDF viewer will show a contents bar for the big file but not the small file. I am no PDF expert.
I process both files with the command below.
curl -v --form input=@./paper.pdf --form teiCoordinates=ref --form teiCoordinates=biblStruct --form teiCoordinates=figure --form teiCoordinates=persName --form teiCoordinates=formula --form segmentSentences=1 --form teiCoordinates=s https://kermitt2-grobid.hf.space/api/processFulltextDocument > ./paper.xml
The XML outputs differ. Specifically, GROBID will correctly output <graphic coords=... type='bitmap'>
for all figures in the small file while it outputs the graphic coords for only 1 figure in the large file, even though it still detects the figures correctly. I am attaching the files for reproducibility.
I would appreciate if someone could help me understand why this happens or at least help me get started with an investigation.