grobid_client_python
grobid_client_python copied to clipboard
Error 408 when consolidating citations
The following file can be processed by grobid when called with curl
, but the equivalent (?) Python command fails with error 408
.
curl call: curl -v --form input=@./meta-chinese.pdf --form consolidateCitations=1 localhost:8070/api/processReferences
python: client.process("processReferences", "./screened_PDF", consolidate_citations=True)
The server log does not show any obvious issues. The python command works when I don't consolidate citations.
Any ideas / suggestions?
Just to add - a basic requests.post
call works from Python. I can't quite see what the client is doing differently ...
import requests
GROBID_URL = 'http://localhost:8070'
url = '%s/api/processReferences' % GROBID_URL
pdf = './screened_PDF/meta-chinese.pdf'
xml = requests.post(url, files={'input': open(pdf, 'rb')}, data = {"consolidateCitations": "1"})
@LukasWallrich, the input_path
should be a directory. Indeed, this is a bug, as the client should say something about it. Single files can be processed by calling process_pdf
. I'm not sure if process_pdf
is meant to be called like that, though.
Hello !
The purpose of this client is to process a directory of files, so to do a batch process, managing concurrency efficiently. I tried to made it explicit from the readme and from the --help
:
--input INPUT path to the directory containing PDF files or .txt
(for processCitationList only, one reference per line)
to process
--output OUTPUT path to the directory where to put the results
(optional)
If you want to process a single PDF file, you can use client.process_pdf()
, but as Luca said, it's not written to be used like that outside a batch process, all the arguments must be provided.
Thank you both! The input here is a folder with two files - the other one works fine. So that does not seem to be the issue.
If it's 408 timeout, it might be simply that crossref API is too slow to consolidate citations. But for 2 files, it means the crossref API is very very slow. You can improve the response time a bit by indicating your email in the Grobid config file (the "polite" usage): https://grobid.readthedocs.io/en/latest/Consolidation/#crossref-rest-api
However, sometimes when it is not in good shape, the Crossref API takes several seconds to answer each requests. With many references, the timeout might be reached (60 seconds). Even with a Plus token, this can happen.
For production, it's not really possible to use Crossref web API, which is why biblio-glutton was developed.
Thanks. Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit. Anyway, the request through the client fails even when there is only one PDF in the folder, while the manual Python request works. Also, the server log shows that crossref request go through every second or so ... so there might be something more specific going on.
For my use case, I only need to process a couple of hundred PDFs, so I can go down the more manual route, but obviously, the client would be helpful ...
Adding the email is a bit difficult as I am on an M2 mac and can thus only run grobid in the Docker container, which is hard to edit.
You don't need to edit the container, simply edit the config file and mount it at launch of the container like that:
docker run --rm --gpus all -p 8080:8070 -p 8081:8071 -v /home/lopez/grobid/grobid-home/config/grobid.yaml:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.7.2-SNAPSHOT
(where /home/lopez/grobid/grobid-home/config/grobid.yaml
is your edited local config file with your email for Crossref politeness)
the server log shows that crossref request go through every second or so ... so there might be something more specific going on.
This is probably too slow... A good rate is to get at least 10 consolidated citations per second to avoid some painful slowness and timeout when parallelizing processing. If it's just a few hundred PDF, you can try the public biblio-glutton (which synchronizes itselft daily with Crossref) with a low concurrency to avoid too heavy load on this cheap server :D