feedparser
feedparser copied to clipboard
Store custom namespace elements inside author/contributor in their dict
Related to #24 and #145, but only affects elements inside <author> or <contributor>. This specifically solves the incorrect parsing of arXiv Atom feeds mentioned in #145, where author affiliations were lost.
Problem
Custom namespace elements (e.g. <arxiv:affiliation>) inside <author> or <contributor> are stored at entry level, causing:
- Loss of association between authors and their custom data
- Overwriting when multiple authors have the same custom element
Scope
This PR does not solve the general problem of how to handle unknown elements (whether to store on the parent, aggregate into a list, or assign as dict fields). However:
- It improves the current (incorrect) behavior where child elements overwrite parent-level fields
- It does not affect any other behavior — all existing tests pass
Example
<entry>
<author>
<name>Alice</name>
<arxiv:affiliation>MIT</arxiv:affiliation>
</author>
<author>
<name>Bob</name>
<arxiv:affiliation>Stanford</arxiv:affiliation>
</author>
</entry>
Before:
entry.authors[0] # {'name': 'Alice'}
entry.authors[1] # {'name': 'Bob'}
entry.arxiv_affiliation # 'Stanford' (overwrites MIT, stored in entry)
After:
entry.authors[0] # {'name': 'Alice', 'arxiv_affiliation': 'MIT'}
entry.authors[1] # {'name': 'Bob', 'arxiv_affiliation': 'Stanford'}
'arxiv_affiliation' in entry # False
Alternatives considered
Adding explicit support for the arXiv namespace (like itunes, dc, media). However, this would increase maintenance burden, and wouldn't help other custom namespaces.
Implementation
The fix is small (~10 lines):
- Add
_maybe_get_author_context()to return current author/contributor dict if inside one - Update
pop()to store unknown elements in author context when applicable - Ensure
_end_author()/_end_contributor()exit author/contributor context beforepop()
Tests
- Added four wellformed Atom 1.0 feeds under
tests/wellformed/atom10/to cover custom elements inside<author>and<contributor>. - The full test suite still passes.