grobid_client_python
grobid_client_python copied to clipboard
Self-promotion: new `grobid_tei_xml` python library
Wanted to share this new python library for parsing metadata out of GROBID "flavor" TEI-XML:
- https://gitlab.com/internetarchive/grobid_tei_xml
- https://pypi.org/project/grobid-tei-xml/
As mentioned in the README, there are a couple other libraries that do similar or the same thing, including generic TEI parsing libraries which are not specific to GROBID. At scholar.archive.org we had a need to extract header and citation metadata in a structured but non-XML format (eg, JSON or python objects), so we wrote this. It uses only the Python 3 standard library, includes type annotations, and has decent test coverage. It supports both older ~v0.5 era GROBID documents as well as more recent output. We have run the output of tens of millions of PDFs through GROBID and this code.