grobid_client_python
grobid_client_python copied to clipboard
Adjustions to grobid-client.py
Dear Patrice,
as we talked in other repo, i adjusted the client that it could parse citations from text. The solution became a bit ugly. But now:
- it reads "txt" file as an input with each citation in new line
- groups citations by thousands (or batch_size specified) and saves them in XML file, naming it by input name plus each thousand (or batch_size specified)
- At the end opens each file and adds appropriate XML beginning and END
- The TXT and PDF files handling are separated after common function "process"
Issues:
- I needed to rename 'input' variable to 'input2' as python was complaining for the name
- Input file must be given in TXT
- If workers specified more than 1, the input file and outcome file is loosing sorting order.
Examples:
if order matters - (--n < 2):
python grobid-client.py --input /path/to/refs/file.txt --n 1
if not - (--n >1 or default)
python grobid-client.py --input /path/to/refs/file.txt
to parse with single worker 2 millions citations with Macbook Pro 2015 it took around 6 hours. Not so slow :)
Here is the file https://github.com/darjusp/contribs/blob/master/grobid-client.py