grobid_client_python icon indicating copy to clipboard operation
grobid_client_python copied to clipboard

Error 408 when consolidating citations

Open LukasWallrich opened this issue 1 year ago • 7 comments

The following file can be processed by grobid when called with curl, but the equivalent (?) Python command fails with error 408.

meta-chinese.pdf

curl call: curl -v --form input=@./meta-chinese.pdf --form consolidateCitations=1 localhost:8070/api/processReferences python: client.process("processReferences", "./screened_PDF", consolidate_citations=True)

The server log does not show any obvious issues. The python command works when I don't consolidate citations.

Any ideas / suggestions?

LukasWallrich avatar Oct 04 '22 12:10 LukasWallrich

Just to add - a basic requests.post call works from Python. I can't quite see what the client is doing differently ...

import requests
GROBID_URL = 'http://localhost:8070'
url = '%s/api/processReferences' % GROBID_URL
pdf = './screened_PDF/meta-chinese.pdf'
xml = requests.post(url, files={'input': open(pdf, 'rb')}, data = {"consolidateCitations": "1"})

LukasWallrich avatar Oct 04 '22 13:10 LukasWallrich

@LukasWallrich, the input_path should be a directory. Indeed, this is a bug, as the client should say something about it. Single files can be processed by calling process_pdf. I'm not sure if process_pdf is meant to be called like that, though.

lfoppiano avatar Oct 04 '22 22:10 lfoppiano

Hello !

The purpose of this client is to process a directory of files, so to do a batch process, managing concurrency efficiently. I tried to made it explicit from the readme and from the --help:

--input INPUT         path to the directory containing PDF files or .txt
                        (for processCitationList only, one reference per line)
                        to process
  --output OUTPUT       path to the directory where to put the results
                        (optional)

If you want to process a single PDF file, you can use client.process_pdf(), but as Luca said, it's not written to be used like that outside a batch process, all the arguments must be provided.

kermitt2 avatar Oct 05 '22 06:10 kermitt2

Thank you both! The input here is a folder with two files - the other one works fine. So that does not seem to be the issue.

LukasWallrich avatar Oct 05 '22 06:10 LukasWallrich

If it's 408 timeout, it might be simply that crossref API is too slow to consolidate citations. But for 2 files, it means the crossref API is very very slow. You can improve the response time a bit by indicating your email in the Grobid config file (the "polite" usage): https://grobid.readthedocs.io/en/latest/Consolidation/#crossref-rest-api

However, sometimes when it is not in good shape, the Crossref API takes several seconds to answer each requests. With many references, the timeout might be reached (60 seconds). Even with a Plus token, this can happen.

For production, it's not really possible to use Crossref web API, which is why biblio-glutton was developed.

kermitt2 avatar Oct 05 '22 06:10 kermitt2

Thanks. Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit. Anyway, the request through the client fails even when there is only one PDF in the folder, while the manual Python request works. Also, the server log shows that crossref request go through every second or so ... so there might be something more specific going on.

For my use case, I only need to process a couple of hundred PDFs, so I can go down the more manual route, but obviously, the client would be helpful ...

LukasWallrich avatar Oct 05 '22 09:10 LukasWallrich

Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit.

You don't need to edit the container, simply edit the config file and mount it at launch of the container like that:

docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro  grobid/grobid:0.7.2-SNAPSHOT

(where /home/lopez/grobid/grobid-home/config/grobid.yaml is your edited local config file with your email for Crossref politeness)

the server log shows that crossref request go through every second or so ... so there might be something more specific going on.

This is probably too slow... A good rate is to get at least 10 consolidated citations per second to avoid some painful slowness and timeout when parallelizing processing. If it's just a few hundred PDF, you can try the public biblio-glutton (which synchronizes itselft daily with Crossref) with a low concurrency to avoid too heavy load on this cheap server :D

kermitt2 avatar Oct 09 '22 10:10 kermitt2