feedparser icon indicating copy to clipboard operation
feedparser copied to clipboard

Store custom namespace elements inside author/contributor in their dict

Open romazu opened this issue 1 month ago • 0 comments

Related to #24 and #145, but only affects elements inside <author> or <contributor>. This specifically solves the incorrect parsing of arXiv Atom feeds mentioned in #145, where author affiliations were lost.

Problem

Custom namespace elements (e.g. <arxiv:affiliation>) inside <author> or <contributor> are stored at entry level, causing:

  1. Loss of association between authors and their custom data
  2. Overwriting when multiple authors have the same custom element

Scope

This PR does not solve the general problem of how to handle unknown elements (whether to store on the parent, aggregate into a list, or assign as dict fields). However:

  1. It improves the current (incorrect) behavior where child elements overwrite parent-level fields
  2. It does not affect any other behavior — all existing tests pass

Example

<entry>
  <author>
    <name>Alice</name>
    <arxiv:affiliation>MIT</arxiv:affiliation>
  </author>
  <author>
    <name>Bob</name>
    <arxiv:affiliation>Stanford</arxiv:affiliation>
  </author>
</entry>

Before:

entry.authors[0]  # {'name': 'Alice'}
entry.authors[1]  # {'name': 'Bob'}
entry.arxiv_affiliation  # 'Stanford' (overwrites MIT, stored in entry)

After:

entry.authors[0]  # {'name': 'Alice', 'arxiv_affiliation': 'MIT'}
entry.authors[1]  # {'name': 'Bob', 'arxiv_affiliation': 'Stanford'}
'arxiv_affiliation' in entry  # False

Alternatives considered

Adding explicit support for the arXiv namespace (like itunes, dc, media). However, this would increase maintenance burden, and wouldn't help other custom namespaces.

Implementation

The fix is small (~10 lines):

  • Add _maybe_get_author_context() to return current author/contributor dict if inside one
  • Update pop() to store unknown elements in author context when applicable
  • Ensure _end_author() / _end_contributor() exit author/contributor context before pop()

Tests

  • Added four wellformed Atom 1.0 feeds under tests/wellformed/atom10/ to cover custom elements inside <author> and <contributor>.
  • The full test suite still passes.

romazu avatar Dec 02 '25 21:12 romazu