reader icon indicating copy to clipboard operation
reader copied to clipboard

reader treats all bozo feeds as errors

Open lemon24 opened this issue 3 years ago • 1 comments

reader treats all bozo feeds as errors, even if the loose parser managed to parse them:

<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>title</title>
  <updated>2021-12-18T11:00:00</updated>
  <id>http://example.com/</id>
  <entry>
    <id>http://example.com/entry</id>
    <updated>2021-07-29T00:00:00</updated>
    <content type="html">
        &#39; &amp; &gt; &ldquo; &lt; &quot; &rdquo; &rsquo;
    </content>
  </entry>
</feed>
{
    'bozo': 1,
    'bozo_exception': SAXParseException('undefined entity'),
    'encoding': 'utf-8',
    'entries': [
        {
            'content': [
                {
                    'base': '',
                    'language': None,
                    'type': 'text/html',
                    'value': '\' & > “ < " ” ’',
                }
            ],
            'id': 'http://example.com/entry',
            'summary': '\' & > “ < " ” ’',
            ...
        }
    ],
    'feed': {
        'id': 'http://example.com/',
        'title': 'title',
        ...
    },
    'headers': {},
    'namespaces': {'': 'http://www.w3.org/2005/Atom'},
    'version': 'atom10',
}

We still need a heuristic to tell that apart from complete garbage (version, and the presence of entries?):

>>> feedparser.parse("garbage")
{'bozo': 1, 'entries': [], 'feed': {}, 'headers': {}, 'encoding': 'utf-8', 'version': '', 'bozo_exception': SAXParseException('syntax error'), 'namespaces': {}}

lemon24 avatar Jan 29 '22 08:01 lemon24

Some conclusions from playing with the Atom feed below:

  • xml.sax.SAXParseException "undefined entity" is survivable.
  • "mismatched tag" is not; we get all the good entries, and then the broken entry, in a bad state (e.g. all content in ); entries after it are missing, but not always.
  • It may be worth finding what other kinds of errors can be encountered... (all of them).

Also, when the loose parser is used, the feed should be considered stale; that is, we should always prefer entries from the non-broken feed.

I'm thinking of something like this:

existing parsed desired behavior current behavior
none any use new (any) yes
any strict use new (strict) yes (hash takes care of it)
strict loose keep old (strict) no (different hash => update)
loose loose use new (loose) yes (hash takes care of it)

This would favor feeds that are temporarily broken, and eventually get fixed. For feeds that become permanently broken, it results in old strict entries not receiving updates.

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

    <entry>
        <id>one</id>
        <title>1</title>
        <summary>i</summary>
    </entry>
    <entry>
        <id>two</id>
        <title>Atom-Powered Robots Run Amok
        <summary>Summary.&veryundefinedentity;
        <content>Content.</content>
    </entry>
    <entry>
        <id>three</id>
        <title>3</title>
        <summary>iii</summary>
    </entry>

</feed>

lemon24 avatar Feb 06 '22 09:02 lemon24