grobid_client_python Adjustions to grobid-client.py

Adjustions to grobid-client.py

Open darjusp opened this issue 5 years ago • 0 comments

Dear Patrice,

as we talked in other repo, i adjusted the client that it could parse citations from text. The solution became a bit ugly. But now:

it reads "txt" file as an input with each citation in new line
groups citations by thousands (or batch_size specified) and saves them in XML file, naming it by input name plus each thousand (or batch_size specified)
At the end opens each file and adds appropriate XML beginning and END
The TXT and PDF files handling are separated after common function "process"

Issues:

I needed to rename 'input' variable to 'input2' as python was complaining for the name
Input file must be given in TXT
If workers specified more than 1, the input file and outcome file is loosing sorting order.

Examples: if order matters - (--n < 2): python grobid-client.py --input /path/to/refs/file.txt --n 1 if not - (--n >1 or default) python grobid-client.py --input /path/to/refs/file.txt

to parse with single worker 2 millions citations with Macbook Pro 2015 it took around 6 hours. Not so slow :)

Here is the file https://github.com/darjusp/contribs/blob/master/grobid-client.py

Aug 12 '19 06:08 darjusp

grobid_client_python grobid_client_python copied to clipboard

Adjustions to grobid-client.py

grobid_client_python
grobid_client_python copied to clipboard