Unable to read in xml file
Having trouble reading in this XML file with the generic XMLReader. It's downloaded from the WIPO patenscope site.
I run:
from chemdataextractor import Document
f = open('USpatenttest.xml', 'rb')
doc=Document.from_file(f)
And I get the error File "/home/ubuntu/miniconda3/envs/reverie_env/lib/python3.6/site-packages/chemdataextractor/reader/markup.py", line 208, in parse root = self._css(self.root_css, root)[0] IndexError: list index out of range
Any advice is greatly appreciated! Thanks
Each parser has a detect() method to determine whether it should be the one to parse a given file. My guess is that the US patent XML parser isn't registering your file. Note a comment left in its detect() method
if b'us-patent-grant' in fstring:
return True
# TODO: Other DTDs
So you probably have to make a subclass of UsptoXmlReader and override the detect() method to accept your file, then pass that subclass into the readers parameter of Document.from_file()
I am trying to parse an xml file using the generic XMLReader and I am also getting this error. When I use the function lxml.etree.fromstring directly, it parses fine. My xml isn't an US patent, as such I can't use the specific reader for this.
It seems when I change the root_css query from html to :root my document can be successfully parsed.
We had similar issues: Valid .xml but IndexError on parsing. Inspired by the unit tests we wrote a manual parser for PMC (NlmXmlReader: for other formats you can change the reader to your use case. See here: http://chemdataextractor.org/docs/reading):
import io
from chemdataextractor.reader import NlmXmlReader
def read_xml_file(fname: str) -> str:
"""Read a xml file manually"""
r = NlmXmlReader()
body = io.open(os.path.join(os.path.dirname(__file__), xml_file), 'rb')
content = body.read()
return r.readstring(content)
fname = 'Your/Path/file.xml'
doc = read_xml_file(fname=fname)