reader icon indicating copy to clipboard operation
reader copied to clipboard

Handle redirects and gone feeds gracefully

Open lemon24 opened this issue 2 years ago • 3 comments

https://feedparser.readthedocs.io/en/latest/http-redirect.html

If you are polling a feed on a regular basis, it is very important to check the status code (d.status) every time you download. If the feed has been permanently redirected, you should update your database or configuration file with the new address (d.href). Repeatedly requesting the original address of a feed that has been permanently redirected is very rude, and may get you banned from the server.

Repeatedly requesting a feed that has been marked as “gone” is very rude, and may get you banned from the server.

lemon24 avatar Jul 16 '21 17:07 lemon24

Related comment:

https://github.com/lemon24/reader/blob/836ff81cf68343b415fb4956d8c69266120f3269/src/reader/_update.py#L460-L461

Misc thoughts:

  • Ideally, this should be a plugin.
  • To allow plugins to handle this, we likely need to expose additional info to after_feed_update_hooks – old feed, new feed, status code or its meaning.
  • For redirects, we need the status code of the initial request – an update can have a redirect and succeed.
  • The plugin that changes the URL must run after all other ones that use the old one (e.g. after_entry_update_hooks).
  • Should the UpdateResult/UpdatedFeed returned by update_feeds_iter()/update_feed() have the new or the old URL?
    • Likely the new one.
  • Assuming an after_feed_update_hooks plugin that runs after the one that changes the URL:
    • Should it get the old feed, the new feed, or both?
    • If we end up wrapping exceptions from updating one feed for #218, and there's an error in the plugin after the URL was changed, what should the url of the wrapper exception be?

lemon24 avatar Jun 17 '23 08:06 lemon24

Just a thought.

Consider API semantics that allows for a plugin to only mark feed url for a change. Then, after processing all of the plugins, you check if any plugin requested a change (and maybe make sure only one did it?), and make the change itself as part of the processing mechanism that runs outside of the plugins.

This is subtle change, but that way you (probably) can drop requirement that such plugin must run as a last one. Also, this seem to simplify issues you mention in the last point and allows for controlling if such request makes sense in the context of any other plugins or other external factors that may occur.

EDIT: typo

zifot avatar Jun 18 '23 08:06 zifot

@zifot, that's actually a great idea, thank you!

I think it's doable right now with tags:

def after_feed_update(reader, feed, ...):
    # runs for each feed
    new_url = is_permanent_redirect(feed, ...)
    if new_url:
        reader.set_tag(feed, '.url-change-needed', new_url)

def after_feeds_update(reader):
    # runs after all the feeds
    for feed in reader.get_feeds(tags=['.url-change-needed']):
        new_url = reader.get_tag(feed, '.url-change-needed')
        # for later: how do we deal with InvalidFeedURLError?
        reader.change_feed_url(feed, new_url)
        reader.delete_tag(new_url, '.url-change-needed')

Note to self: This seems like a very useful pattern, mention it in the docs for plugin authors (when we have them). The way we're handling .reader.dedupe.once for entry_dedupe is vaguely similar (mark, then change).

lemon24 avatar Jun 18 '23 10:06 lemon24