ChemDataExtractor Unable to read in xml file

USpatenttest.xml.zip

Having trouble reading in this XML file with the generic XMLReader. It's downloaded from the WIPO patenscope site.

I run: from chemdataextractor import Document f = open('USpatenttest.xml', 'rb') doc=Document.from_file(f)

And I get the error File "/home/ubuntu/miniconda3/envs/reverie_env/lib/python3.6/site-packages/chemdataextractor/reader/markup.py", line 208, in parse root = self._css(self.root_css, root)[0] IndexError: list index out of range

Any advice is greatly appreciated! Thanks

Sep 18 '20 04:09 sophiatabchouri

Each parser has a detect() method to determine whether it should be the one to parse a given file. My guess is that the US patent XML parser isn't registering your file. Note a comment left in its detect() method

        if b'us-patent-grant' in fstring:
            return True
        # TODO: Other DTDs

So you probably have to make a subclass of UsptoXmlReader and override the detect() method to accept your file, then pass that subclass into the readers parameter of Document.from_file()

Sep 20 '20 02:09 maddenfederico

I am trying to parse an xml file using the generic XMLReader and I am also getting this error. When I use the function lxml.etree.fromstring directly, it parses fine. My xml isn't an US patent, as such I can't use the specific reader for this.

It seems when I change the root_css query from html to :root my document can be successfully parsed.

Sep 22 '20 13:09 lameturkey

We had similar issues: Valid .xml but IndexError on parsing. Inspired by the unit tests we wrote a manual parser for PMC (NlmXmlReader: for other formats you can change the reader to your use case. See here: http://chemdataextractor.org/docs/reading):

import io
from chemdataextractor.reader import NlmXmlReader

def read_xml_file(fname: str) -> str:
    """Read a xml file manually"""
    r = NlmXmlReader()
    body = io.open(os.path.join(os.path.dirname(__file__), xml_file), 'rb')
    content = body.read()

    return r.readstring(content)

fname = 'Your/Path/file.xml'
doc = read_xml_file(fname=fname)

Sep 21 '23 14:09 fmoorhof