Extend `feedparser.parse()` with `archive_url_data:bool` and `request_hooks`
We needed a way to archive the data that feedparser uses when processing a url, for the purposes of troubleshooting, running tests and regression analysis.
There were two options to achieve that:
1- Download the URL ourselves, then parse that with feedparser. 2- Extend feedparser to save the "raw" data
This PR is a quick attempt at the latter, as the utility to handle this in troubleshooting is widely applicable:
-
introduces
archive_url_data:booltofeedparser.parse. if set, a.rawattribute on the result FeedParserDict will contain the "content" and "headers" headers are copied to this BEFORE they are updated by kwargs -
extends
feedparser.api._open_resourceto return the "type" of data accessed, in addition to the data -
Additionally,
request_hooksare added to parse. This is a dict containing "hooks" to pass on to "requests.get" for customization. It also supports a "response.postprocess" hook, which is not passed on to requests - and can be used to operate on the response before it is lost. This allows for capturing the actual IP address of the remote server, as shown below. (Theresponse_peername__hookneeds to execute before content is read from the connection.)
I'm happy to achieve this other ways and work towards an acceptable PR - I'd just like to ensure there is a way to access/operate the raw data feedparser natively pulls out. We've had issues due to networking/round-robin-dns and throttling that are best identified and only solved by examining this info.
import typing
import feedparser
from feedparser.http import RequestHooks
from metadata_parser.requests_extensions import response_peername__hook
if typing.TYPE_CHECKING:
from requests import Response
from feedparser.util import FeedParserDict
def process_result(response: "Response", result: "FeedParserDict") -> None:
result.raw["peername"] = response._mp_peername
request_hooks: RequestHooks = {
"response": response_peername__hook,
"response.postprocess": process_result,
}
feed = feedparser.parse(
"https://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml",
archive_url_data=True,
requests_hooks=request_hooks,
)
print("Feed was downloaded from:", feed.raw["peername"])
Fixes: #289