reader Consider using defusedxml

Consider using defusedxml

Open lemon24 opened this issue 3 years ago • 3 comments

https://pypi.org/project/defusedxml/

Related issue: https://github.com/kurtmckee/feedparser/issues/107

We have two (obvious) options here:

contribute to feedparser, see the issue linked above (I don't know if it's easy to do)
pass the XML stream through defusedxml before passing it to feedparser
- this will likely remove feedparser's ability to deal with broken xml (if it still can do that)

Jan 24 '21 09:01 lemon24

This is one way of doing it:

feedparser.api.PREFERRED_XML_PARSERS.insert(0, 'defusedxml.expatreader')

Note that it's global; maybe we can make a full copy of the feedparser.api module at runtime, to avoid monkeypatching? Update: https://stackoverflow.com/a/11285504

An alternative is to (have the user) use defusedxml.defuse_stdlib() (unsupported) by themselves and be done with it.

Jun 02 '21 14:06 lemon24

There's one of my feeds that fails when I try the above.

unexpected error while reading feed: 'http://www.xn--8ws00zhy3a.com/feed': defusedxml.common.EntitiesForbidden: EntitiesForbidden(name='xhtml', system_id=None, public_id=None)

The feed looks like this (note where &id; is used; it's likely critical to actually expand it):

<?xml version="1.0" encoding="utf-8"?>

<!DOCTYPE feed [
  <!ENTITY xhtml "http://www.w3.org/1999/xhtml">
  <!ENTITY id "tag:xn--8ws00zhy3a.com,">
]>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-gb" xml:base="http://www.詹姆斯.com/feed">

...

  <entry>
    <title>HTML Kong</title>
    <id>&id;2016-07-18:/blog/16</id>
    <updated>2016-07-18T05:14:09+00:00</updated>
    <summary type="xhtml">
      <div xmlns="&xhtml;">

I think this shows the need to be able to whitelist feeds.

It would be nice if this were granular (just allow "xhtml" and "id"), but TBD that seems like a poor user experience. Also, if the feed becomes malicious, they could just change what the whitelisted entities mean. A single "trust this feed" is likely enough.

Jul 25 '21 09:07 lemon24

FWIW, using lxml may be good enough: https://pypi.org/project/defusedxml/#python-xml-libraries (most of the vulnerabilities are marked with False, and we may not care about those marked with True; needs looking into).

Here's roughly what we need to do to use lxml with feedparser:

import lxml.etree, lxml.sax

class XMLParser:
    def __init__(self):
        self.handler = None
    def setFeature(self, *args):
        # we need to support/assert at least these:
        #setFeature('http://xml.org/sax/features/namespaces', 1)
        #setFeature('http://xml.org/sax/features/external-general-entities', 0)
        pass
    def setContentHandler(self, handler):
        assert not self.handler
        self.handler = handler
    def setErrorHandler(self, handler):
        assert self.handler is handler
    def parse(self, source):
        parser = lxml.etree.XMLParser(recover=True)
        tree = lxml.etree.parse(source.getByteStream(), parser)
        return lxml.sax.saxify(tree, self.handler)
            
    
def create_parser(encoding=None):
    return XMLParser()

import feedparser

feedparser.api.PREFERRED_XML_PARSERS.insert(0, '__main__')

Nov 01 '21 17:11 lemon24

reader reader copied to clipboard

Consider using defusedxml

reader
reader copied to clipboard