Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks`

Open jvanasco opened this issue 2 months ago • 0 comments

We needed a way to archive the data that feedparser uses when processing a url, for the purposes of troubleshooting, running tests and regression analysis.

There were two options to achieve that:

1- Download the URL ourselves, then parse that with feedparser. 2- Extend feedparser to save the "raw" data

This PR is a quick attempt at the latter, as the utility to handle this in troubleshooting is widely applicable:

introduces archive_url_data:bool to feedparser.parse. if set, a .raw attribute on the result FeedParserDict will contain the "content" and "headers" headers are copied to this BEFORE they are updated by kwargs
extends feedparser.api._open_resource to return the "type" of data accessed, in addition to the data
Additionally, request_hooks are added to parse. This is a dict containing "hooks" to pass on to "requests.get" for customization. It also supports a "response.postprocess" hook, which is not passed on to requests - and can be used to operate on the response before it is lost. This allows for capturing the actual IP address of the remote server, as shown below. (The response_peername__hook needs to execute before content is read from the connection.)

I'm happy to achieve this other ways and work towards an acceptable PR - I'd just like to ensure there is a way to access/operate the raw data feedparser natively pulls out. We've had issues due to networking/round-robin-dns and throttling that are best identified and only solved by examining this info.

import typing

import feedparser
from feedparser.http import RequestHooks

from metadata_parser.requests_extensions import response_peername__hook

if typing.TYPE_CHECKING:
    from requests import Response

    from feedparser.util import FeedParserDict

def process_result(response: "Response", result: "FeedParserDict") -> None:
    result.raw["peername"] = response._mp_peername


request_hooks: RequestHooks = {
    "response": response_peername__hook,
    "response.postprocess": process_result,
}

feed = feedparser.parse(
    "https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml",
    archive_url_data=True,
    requests_hooks=request_hooks,
)

print("Feed was downloaded from:", feed.raw["peername"])

Fixes: #289

Nov 06 '25 23:11 jvanasco