reader
reader copied to clipboard
reader treats all bozo feeds as errors
reader treats all bozo feeds as errors, even if the loose parser managed to parse them:
<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>title</title>
<updated>2021-12-18T11:00:00</updated>
<id>http://example.com/</id>
<entry>
<id>http://example.com/entry</id>
<updated>2021-07-29T00:00:00</updated>
<content type="html">
' & > “ < " ” ’
</content>
</entry>
</feed>
{
'bozo': 1,
'bozo_exception': SAXParseException('undefined entity'),
'encoding': 'utf-8',
'entries': [
{
'content': [
{
'base': '',
'language': None,
'type': 'text/html',
'value': '\' & > “ < " ” ’',
}
],
'id': 'http://example.com/entry',
'summary': '\' & > “ < " ” ’',
...
}
],
'feed': {
'id': 'http://example.com/',
'title': 'title',
...
},
'headers': {},
'namespaces': {'': 'http://www.w3.org/2005/Atom'},
'version': 'atom10',
}
We still need a heuristic to tell that apart from complete garbage (version, and the presence of entries?):
>>> feedparser.parse("garbage")
{'bozo': 1, 'entries': [], 'feed': {}, 'headers': {}, 'encoding': 'utf-8', 'version': '', 'bozo_exception': SAXParseException('syntax error'), 'namespaces': {}}
Some conclusions from playing with the Atom feed below:
- xml.sax.SAXParseException "undefined entity" is survivable.
- "mismatched tag" is not; we get all the good entries, and then the broken entry, in a bad state (e.g. all content in
); entries after it are missing, but not always. - It may be worth finding what other kinds of errors can be encountered... (all of them).
Also, when the loose parser is used, the feed should be considered stale; that is, we should always prefer entries from the non-broken feed.
I'm thinking of something like this:
| existing | parsed | desired behavior | current behavior |
|---|---|---|---|
| none | any | use new (any) | yes |
| any | strict | use new (strict) | yes (hash takes care of it) |
| strict | loose | keep old (strict) | no (different hash => update) |
| loose | loose | use new (loose) | yes (hash takes care of it) |
This would favor feeds that are temporarily broken, and eventually get fixed. For feeds that become permanently broken, it results in old strict entries not receiving updates.
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<id>one</id>
<title>1</title>
<summary>i</summary>
</entry>
<entry>
<id>two</id>
<title>Atom-Powered Robots Run Amok
<summary>Summary.&veryundefinedentity;
<content>Content.</content>
</entry>
<entry>
<id>three</id>
<title>3</title>
<summary>iii</summary>
</entry>
</feed>