reader
reader copied to clipboard
Handle redirects and gone feeds gracefully
https://feedparser.readthedocs.io/en/latest/http-redirect.html
If you are polling a feed on a regular basis, it is very important to check the status code (d.status) every time you download. If the feed has been permanently redirected, you should update your database or configuration file with the new address (d.href). Repeatedly requesting the original address of a feed that has been permanently redirected is very rude, and may get you banned from the server.
Repeatedly requesting a feed that has been marked as “gone” is very rude, and may get you banned from the server.
Related comment:
https://github.com/lemon24/reader/blob/836ff81cf68343b415fb4956d8c69266120f3269/src/reader/_update.py#L460-L461
Misc thoughts:
- Ideally, this should be a plugin.
- To allow plugins to handle this, we likely need to expose additional info to after_feed_update_hooks – old feed, new feed, status code or its meaning.
- For redirects, we need the status code of the initial request – an update can have a redirect and succeed.
- The plugin that changes the URL must run after all other ones that use the old one (e.g. after_entry_update_hooks).
- Should the UpdateResult/UpdatedFeed returned by update_feeds_iter()/update_feed() have the new or the old URL?
- Likely the new one.
- Assuming an after_feed_update_hooks plugin that runs after the one that changes the URL:
- Should it get the old feed, the new feed, or both?
- If we end up wrapping exceptions from updating one feed for #218, and there's an error in the plugin after the URL was changed, what should the
url
of the wrapper exception be?
Just a thought.
Consider API semantics that allows for a plugin to only mark feed url for a change. Then, after processing all of the plugins, you check if any plugin requested a change (and maybe make sure only one did it?), and make the change itself as part of the processing mechanism that runs outside of the plugins.
This is subtle change, but that way you (probably) can drop requirement that such plugin must run as a last one. Also, this seem to simplify issues you mention in the last point and allows for controlling if such request makes sense in the context of any other plugins or other external factors that may occur.
EDIT: typo
@zifot, that's actually a great idea, thank you!
I think it's doable right now with tags:
def after_feed_update(reader, feed, ...):
# runs for each feed
new_url = is_permanent_redirect(feed, ...)
if new_url:
reader.set_tag(feed, '.url-change-needed', new_url)
def after_feeds_update(reader):
# runs after all the feeds
for feed in reader.get_feeds(tags=['.url-change-needed']):
new_url = reader.get_tag(feed, '.url-change-needed')
# for later: how do we deal with InvalidFeedURLError?
reader.change_feed_url(feed, new_url)
reader.delete_tag(new_url, '.url-change-needed')
Note to self: This seems like a very useful pattern, mention it in the docs for plugin authors (when we have them). The way we're handling .reader.dedupe.once for entry_dedupe is vaguely similar (mark, then change).