grobid_client_python icon indicating copy to clipboard operation
grobid_client_python copied to clipboard

Adjustions to grobid-client.py

Open darjusp opened this issue 4 years ago • 0 comments

Dear Patrice,

as we talked in other repo, i adjusted the client that it could parse citations from text. The solution became a bit ugly. But now:

  1. it reads "txt" file as an input with each citation in new line
  2. groups citations by thousands (or batch_size specified) and saves them in XML file, naming it by input name plus each thousand (or batch_size specified)
  3. At the end opens each file and adds appropriate XML beginning and END
  4. The TXT and PDF files handling are separated after common function "process"

Issues:

  1. I needed to rename 'input' variable to 'input2' as python was complaining for the name
  2. Input file must be given in TXT
  3. If workers specified more than 1, the input file and outcome file is loosing sorting order.

Examples: if order matters - (--n < 2): python grobid-client.py --input /path/to/refs/file.txt --n 1 if not - (--n >1 or default) python grobid-client.py --input /path/to/refs/file.txt

to parse with single worker 2 millions citations with Macbook Pro 2015 it took around 6 hours. Not so slow :)

Here is the file https://github.com/darjusp/contribs/blob/master/grobid-client.py

darjusp avatar Aug 12 '19 06:08 darjusp