grobid_client_python icon indicating copy to clipboard operation
grobid_client_python copied to clipboard

Passing files directly into Grobid without downloading

Open matthieu-perso opened this issue 1 year ago • 1 comments

Hey Grobid team,

Thanks again for these incredible tools. I've been testing out the Python client - and encountered an issue when passing a PDF as an argument both while using the CLI and Python. I didn't receive any output.

Sample code below

grobid_client --input ./resource/my.PDF --output ./out processFulltextDocument

I realized while debugging that L122 of the grobid_client.py file implies passing in a directory and not the file itself as in the below request.

grobid_client --input ./resource/mypdfdir --output ./out processFulltextDocument

On GCP, I was trying to pass files directly in Grobid without downloading them - which I would have to do with the current setup. Anyway to stream PDFs in Grobid ? Or to send them as file objects ? If not, I'll try to see if I can pull something off quickly and test it.

matthieu-perso avatar Aug 09 '22 12:08 matthieu-perso

Hi @MatthieuMoullecDev !

This client takes indeed a directory as input/output, as documented, because this is directed to batch processing of many files.

For me this client is a basis that can be adapted to different usage scenario, so I tried to keep it simple, with zero external dependencies. You can use the client as a package and then call process_batch() or process_pdf() as it is convenient on set of files and pipeline.

You can probably start sending files while downloading to the Grobid server, but Grobid will only start processing a file when it is entirely uploaded (for stability/robustness and technical reasons). So the easiest for your scenario is probably to download a file, add it to an executor, and then delete the file when the result is ready.

From my experience, if no consolidation of citation is used, Grobid is faster to process a file than required to download a typical Unpaywall file.

kermitt2 avatar Aug 11 '22 04:08 kermitt2