reader reader treats all bozo feeds as errors

reader treats all bozo feeds as errors

Open lemon24 opened this issue 3 years ago • 1 comments

reader treats all bozo feeds as errors, even if the loose parser managed to parse them:

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>title</title>
  <updated>2021-12-18T11:00:00</updated>
  <id>http://example.com/</id>
  <entry>
    <id>http://example.com/entry</id>
    <updated>2021-07-29T00:00:00</updated>
    <content type="html">
        &#39; &amp; &gt; &ldquo; &lt; &quot; &rdquo; &rsquo;
    </content>
  </entry>
</feed>

{
    'bozo': 1,
    'bozo_exception': SAXParseException('undefined entity'),
    'encoding': 'utf-8',
    'entries': [
        {
            'content': [
                {
                    'base': '',
                    'language': None,
                    'type': 'text/html',
                    'value': '\' & > “ < " ” ’',
                }
            ],
            'id': 'http://example.com/entry',
            'summary': '\' & > “ < " ” ’',
            ...
        }
    ],
    'feed': {
        'id': 'http://example.com/',
        'title': 'title',
        ...
    },
    'headers': {},
    'namespaces': {'': 'http://www.w3.org/2005/Atom'},
    'version': 'atom10',
}

We still need a heuristic to tell that apart from complete garbage (version, and the presence of entries?):

>>> feedparser.parse("garbage")
{'bozo': 1, 'entries': [], 'feed': {}, 'headers': {}, 'encoding': 'utf-8', 'version': '', 'bozo_exception': SAXParseException('syntax error'), 'namespaces': {}}

Jan 29 '22 08:01 lemon24

Some conclusions from playing with the Atom feed below:

xml.sax.SAXParseException "undefined entity" is survivable.
"mismatched tag" is not; we get all the good entries, and then the broken entry, in a bad state (e.g. all content in ); entries after it are missing, but not always.
It may be worth finding what other kinds of errors can be encountered... (all of them).

Also, when the loose parser is used, the feed should be considered stale; that is, we should always prefer entries from the non-broken feed.

I'm thinking of something like this:

existing	parsed	desired behavior	current behavior
none	any	use new (any)	yes
any	strict	use new (strict)	yes (hash takes care of it)
strict	loose	keep old (strict)	no (different hash => update)
loose	loose	use new (loose)	yes (hash takes care of it)

This would favor feeds that are temporarily broken, and eventually get fixed. For feeds that become permanently broken, it results in old strict entries not receiving updates.

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

    <entry>
        <id>one</id>
        <title>1</title>
        <summary>i</summary>
    </entry>
    <entry>
        <id>two</id>
        <title>Atom-Powered Robots Run Amok
        <summary>Summary.&veryundefinedentity;
        <content>Content.</content>
    </entry>
    <entry>
        <id>three</id>
        <title>3</title>
        <summary>iii</summary>
    </entry>

</feed>

Feb 06 '22 09:02 lemon24

reader reader copied to clipboard

reader treats all bozo feeds as errors

reader
reader copied to clipboard