grobid_client_python icon indicating copy to clipboard operation
grobid_client_python copied to clipboard

Created a download argument for running the client

Open amensiko opened this issue 3 years ago • 1 comments

Hello!

Myself and my colleagues at NASA Jet Propulsion Laboratory got to use the Grobid Python Client for an internal project and have found it extremely useful for parsing scientific papers and extracting useful information from them. Grobid is certainly one of the most incredible parsing tools out there and it has helped us tremendously, so thank you so much for all your work!

Something that we really wanted to use the client for was the ability to parse the PDFs without downloading the output XMLs locally. I didn't see it as an option/argument for the client so I created it and added it to the code. In short, passing the --download flag as False will save the output in a cache represented by a list of tuples, where each tuple represents a file and it contains the filename, the path, and the XML output in a string form. Later on, the cache (client.cache) can be used for further parsing if need be (see an example in test-cache.py). Passing the --download flag as True will save the XML files locally, as the client did before my modifications.

I wanted to share my modifications in case they could be of use to others. Please let me know if you have any questions or concerns!

Anastasija

amensiko avatar Dec 17 '20 23:12 amensiko

Hi @amensiko !

Thanks a lot for the nice words on Grobid and the PR !

If I understand well, the download option you introduce is actually a "write" option. The XML result is always downloaded, but you would like to have it not written on file (as it is by default in process_pdf()) but in a str variable, all these XML strings being accumulated in a array at the client itself.

The issue with a cache maintained in the client itself is that it will blow-up memory as soon there are many PDF processed (or we would need a disk DB for the cache), which is the purpose of this client.

If I understand your use case correctly, maybe you would like the XML written in a stream passed to the client instead of the file system? I guess we could use StringIO classes from Python io standard library, to pass a Stream to the client, as alternative to the default file system. When I wrote this client, it was more an example of usage of Grobid API in a concurrent manner, to be adapted depending on the use case (writing in a DB, in a stream, etc.), but it would be the opportunity to think about a more generic/complete/packaged client.

kermitt2 avatar Dec 20 '20 06:12 kermitt2