feedparser
feedparser copied to clipboard
Elements get dropped from feed items with repeated nested elements
Because feedparser constructs entries as flat dictionaries, any repeated elements can get clobbered, even if they are nested in separate elements.
For example, the following example feed repeats <bc:name> under <bc:location> and <bc:contact>:
<?xml version="1.0" encoding="UTF-8" ?>
<rss version="2.0" xmlns:bc="http://example.com/rss">
<channel>
<title>Posts</title>
<description>RSS feed</description>
<link>https://example.com/feed</link>
<language>en-us</language>
<lastBuildDate>Fri, 13 Jul 2018 01:15:06 +00:00</lastBuildDate>
<generator uri="https://www.example.com" version="1.0">Example</generator>
<item>
<title>Title</title>
<description><![CDATA[Some description]]></description>
<link>https://example.com/posts</link>
<guid>https://example.com/posts/123</guid>
<bc:location>
<bc:name>Location name</bc:name>
</bc:location>
<bc:contact>
<bc:name>Contact name</bc:name>
</bc:contact>
</item>
</channel>
</rss>
Parsing it with feedparser.parse() results in the following (notice that the <bc:name> element with "Location name" is missing from entries):
{'bozo': 0,
'encoding': 'utf-8',
'entries': [{'bc_contact': '',
'bc_location': '',
'bc_name': 'Contact name',
'guidislink': False,
'id': 'https://example.com/posts/123',
'link': 'https://example.com/posts',
'links': [{'href': 'https://example.com/posts',
'rel': 'alternate',
'type': 'text/html'}],
'summary': 'Some description',
'summary_detail': {'base': '',
'language': None,
'type': 'text/html',
'value': 'Some description'},
'title': 'Title',
'title_detail': {'base': '',
'language': None,
'type': 'text/plain',
'value': 'Title'}}],
'feed': {'generator': 'Example',
'generator_detail': {'href': 'https://www.example.com',
'name': 'Example',
'version': '1.0'},
'language': 'en-us',
'link': 'https://example.com/feed',
'links': [{'href': 'https://example.com/feed',
'rel': 'alternate',
'type': 'text/html'}],
'subtitle': 'RSS feed',
'subtitle_detail': {'base': '',
'language': None,
'type': 'text/html',
'value': 'RSS feed'},
'title': 'Posts',
'title_detail': {'base': '',
'language': None,
'type': 'text/plain',
'value': 'Posts'},
'updated': 'Fri, 13 Jul 2018 01:15:06 +00:00',
'updated_parsed': None},
'namespaces': {'bc': 'http://example.com/rss'},
'version': 'rss20'}
This does not seem correct to me. I expect the representation of the RSS feed not to delete elements, but maybe there is some part of the RSS spec that says that elements cannot be repeated?