reader
reader copied to clipboard
Consider using defusedxml
https://pypi.org/project/defusedxml/
Related issue: https://github.com/kurtmckee/feedparser/issues/107
We have two (obvious) options here:
- contribute to feedparser, see the issue linked above (I don't know if it's easy to do)
- pass the XML stream through defusedxml before passing it to feedparser
- this will likely remove feedparser's ability to deal with broken xml (if it still can do that)
This is one way of doing it:
feedparser.api.PREFERRED_XML_PARSERS.insert(0, 'defusedxml.expatreader')
Note that it's global; maybe we can make a full copy of the feedparser.api module at runtime, to avoid monkeypatching? Update: https://stackoverflow.com/a/11285504
An alternative is to (have the user) use defusedxml.defuse_stdlib()
(unsupported) by themselves and be done with it.
There's one of my feeds that fails when I try the above.
unexpected error while reading feed: 'http://www.xn--8ws00zhy3a.com/feed': defusedxml.common.EntitiesForbidden: EntitiesForbidden(name='xhtml', system_id=None, public_id=None)
The feed looks like this (note where &id;
is used; it's likely critical to actually expand it):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE feed [
<!ENTITY xhtml "http://www.w3.org/1999/xhtml">
<!ENTITY id "tag:xn--8ws00zhy3a.com,">
]>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en-gb" xml:base="http://www.詹姆斯.com/feed">
...
<entry>
<title>HTML Kong</title>
<id>&id;2016-07-18:/blog/16</id>
<updated>2016-07-18T05:14:09+00:00</updated>
<summary type="xhtml">
<div xmlns="&xhtml;">
I think this shows the need to be able to whitelist feeds.
It would be nice if this were granular (just allow "xhtml" and "id"), but TBD that seems like a poor user experience. Also, if the feed becomes malicious, they could just change what the whitelisted entities mean. A single "trust this feed" is likely enough.
FWIW, using lxml may be good enough: https://pypi.org/project/defusedxml/#python-xml-libraries (most of the vulnerabilities are marked with False, and we may not care about those marked with True; needs looking into).
Here's roughly what we need to do to use lxml with feedparser:
import lxml.etree, lxml.sax
class XMLParser:
def __init__(self):
self.handler = None
def setFeature(self, *args):
# we need to support/assert at least these:
#setFeature('http://xml.org/sax/features/namespaces', 1)
#setFeature('http://xml.org/sax/features/external-general-entities', 0)
pass
def setContentHandler(self, handler):
assert not self.handler
self.handler = handler
def setErrorHandler(self, handler):
assert self.handler is handler
def parse(self, source):
parser = lxml.etree.XMLParser(recover=True)
tree = lxml.etree.parse(source.getByteStream(), parser)
return lxml.sax.saxify(tree, self.handler)
def create_parser(encoding=None):
return XMLParser()
import feedparser
feedparser.api.PREFERRED_XML_PARSERS.insert(0, '__main__')