grobid icon indicating copy to clipboard operation
grobid copied to clipboard

Same document, different PDF files, same curl command, predictably different output.

Open haykharut opened this issue 7 months ago • 2 comments

I have 2 PDF versions of a paper, which look exactly the same when inspected visually. The only difference I can detect is file size (2.2MB vs 900KB) and the fact that my PDF viewer will show a contents bar for the big file but not the small file. I am no PDF expert.

I process both files with the command below.

curl -v --form input=@./paper.pdf --form teiCoordinates=ref --form teiCoordinates=biblStruct --form teiCoordinates=figure --form teiCoordinates=persName --form teiCoordinates=formula --form segmentSentences=1 --form teiCoordinates=s https://kermitt2-grobid.hf.space/api/processFulltextDocument > ./paper.xml

The XML outputs differ. Specifically, GROBID will correctly output <graphic coords=... type='bitmap'> for all figures in the small file while it outputs the graphic coords for only 1 figure in the large file, even though it still detects the figures correctly. I am attaching the files for reproducibility.

I would appreciate if someone could help me understand why this happens or at least help me get started with an investigation.

paper_big.pdf paper_small.pdf

haykharut avatar Jun 29 '24 13:06 haykharut