reader icon indicating copy to clipboard operation
reader copied to clipboard

Different feed update frequencies

Open lemon24 opened this issue 1 year ago • 2 comments

#307 needs a way of changing the feed update frequency. Just like https://github.com/lemon24/reader/issues/96#issuecomment-1236304134, this should be a configurable global/per-feed strategy.

Use cases:

  • different recurring frequencies (1h, 6h, 1d); a naive #307 implementation could bump the feed up to the next less frequent value
  • no earlier than X; can be used by #307 to react to Retry-After
  • only new feeds; special case that already exists for update_feeds(new=True)
  • never/infinity; special case that already exists for update_feeds(updates_enabled=True)
    • although, updates_enabled should be independent (e.g. if I temporarily disable updates, the configured frequency should not be deleted)
  • maybe adding some kind of jitter to regular updates would be nice / polite

A unified way of updating feeds would be nice as well, e.g. "just call this every minute". Currently, I'm running the following in a cron:

  • every hour: update
  • every minute update --new-only; search update

lemon24 avatar Mar 18 '24 07:03 lemon24

Logic is easy enough (turning update_after etc. to/from seconds not shown):

import random
from dataclasses import dataclass

@dataclass
class Feed:
    url: str
    update_period: int
    update_after: int = 0
    # last_updated is only set when the feed is actually updated
    # (not when it's not modified, not when there was an exception)
    # https://github.com/lemon24/reader/blob/3.12/src/reader/_update.py#L276
    # https://github.com/lemon24/reader/blob/3.12/src/reader/_update.py#L445
    last_retrieved: int = 0

def get_feeds_for_update(feeds, now):
    return [f for f in feeds if f.update_after <= now]

def next_period(feed, now, jitter_ratio=0):
    jitter = random.random() * jitter_ratio
    current_period_no = now // feed.update_period
    return (current_period_no + 1 + jitter) * feed.update_period

def update_feeds(feeds, now, get_update_after=next_period):
    to_update = get_feeds_for_update(feeds, now)
    for feed in to_update:
        feed.last_retrieved = now
        feed.update_after = get_update_after(feed, now)
    return to_update

def set_update_period(feed, update_period):
    feed.update_period = update_period
    feed.update_after = next_period(feed, feed.last_retrieved)
Tests:
from collections import Counter
from functools import partial
import pytest


@pytest.mark.parametrize('old_after, new_after, now', [
    (0, 10, 0),
    (0, 10, 1),
    (0, 10, 9.999),
    (0, 20, 10),
    (5, 10, 5),
    (10, 20, 10),
    (105, 110, 109),
    (105, 120, 110),
    (105, 200, 199.999),
    (105, 210, 200),
    
])
def test_update(old_after, new_after, now):
    feeds = [Feed('one', 10, old_after)]
    assert len(update_feeds(feeds, now)) == 1
    assert feeds == [Feed('one', 10, new_after, now)]    


@pytest.mark.parametrize('old_after, now', [
    (5, 4),
    (10, 9.999),
    (20, 19),
])
def test_no_update(old_after, now):
    feeds = [Feed('one', 10, old_after)]
    assert len(update_feeds(feeds, now)) == 0
    assert feeds == [Feed('one', 10, old_after)]    


@pytest.mark.parametrize('get_update_after', [
    next_period, 
    # jitter ratio less than 10-1, to account for time step
    partial(next_period, jitter_ratio=.9),
])
def test_sweep(get_update_after):
    feeds = [Feed('one', 10), Feed('two', 20), Feed('three', 100)]
    
    counts = Counter()
    for now in range(100):
        for feed in update_feeds(feeds, now, get_update_after):
            counts[feed.url] += 1
            
    assert counts == {'one': 10, 'two': 5, 'three': 1}


def test_set_period_up():
    feeds = [Feed('one', 10)]
    update_feeds(feeds, 5)
    
    set_update_period(feeds[0], 20)
    
    # no update needed, already updated in this period
    assert len(update_feeds(feeds, 15)) == 0


def test_set_period_down():
    feeds = [Feed('one', 20)]
    update_feeds(feeds, 5)
    
    set_update_period(feeds[0], 10)
    
    # update needed, if taking new period into account
    assert len(update_feeds(feeds, 15)) == 1
    

On to the API!

Update: We can get rid of last_retrieve and rely on the current time in set_update_period(); all tests pass with minimal changes.

lemon24 avatar Mar 31 '24 07:03 lemon24

API

Add update_feeds(scheduled: bool | None = None) argument that filters feeds to update:

  • if none: if global tag .reader.update is set (regardless of value), assume true, else assume false
  • if true: update only update_after < now
  • if false: update everything (no update_after filtering)

In reader 4.0 (#291), we can make scheduled default to True (or just change the default behavior).


To configure the update interval, we can use a .reader.update tag:

  • can be set globally or per feed, feed overrides global overrides default
  • format: interval: int (seconds), jitter: float|bool (in [0,1])
    • for interval, we could also use cron expressions, or at least @hourly, @daily, @weekly, @monthly values
  • default: {interval: 3600, jitter: 0.25}

Using tags is good because it allows configuring stuff without additional UI.

Possible cons (WIP):

  • We need to find a way to reset update_after when the update interval changes (a dedicated set_feed_update_period() method can take care of this out of the box). One solution may be being able to register hooks on tag updates.
  • We want to validate the config. How do we handle validation errors? Ignore / use default, or raise to the user? The latter may be accomplished with a hook, although a standard validation mechanism would be better.

In the (low level) storage API:

  • add update_after and last_retrieved to FeedUpdateIntent and FeedForUpdate
  • get_feeds_for_update(after: datetime | None) (also returns nulls)
    • new uses last_updated, it should use last_retrieved
  • set_feed_update_after() (used when setting the interval)

lemon24 avatar Apr 08 '24 20:04 lemon24

To do (minimal):

  • [x] add update_after and last_retrieved on Feed, FeedUpdateIntent, and FeedForUpdate
  • [x] get_feeds_for_update(after)
    • [x] (new should use last_retrieved)
  • [x] update_feeds(scheduled) / get_feeds(scheduled) / get_feed_counts(scheduled)
  • [x] update logic (just ignore invalid tag values and use default)
  • [x] expose scheduled in the CLI
  • [x] make the cli_status plugin entries make sense in a --scheduled world
  • [x] tests
  • documentation
    • [x] docstrings
    • [x] user guide
    • [x] changelog, dev notes

Later:

  • reset update_after after interval changes / set_feed_update_after()
  • scheduled=None (default to false initially)
  • config validation

lemon24 avatar May 04 '24 10:05 lemon24

update_after and last_retrieved should go on FeedUpdateIntent, and in turn FeedUpdateIntent should have union of FeedData-with-extra-stuff or exception or None, but I'm having trouble finding a good name for FeedData-with-extra-stuff.

For reference, here's all the feed-related data classes and how they're used:

        .-- FeedForUpdate ---.
        v                    |
     parser                  |
        |                    | 
    ParsedFeed            storage -. 
  (has FeedData)             ^     |
        v                    |     |
     updater                 |     |
        |                    |     |
        |- FeedUpdateIntent -'    Feed
        | (has ??? (has FeedData)  |
        |   or ExceptionInfo       |
        |   or None)               |
        |                          v
        '---- UpdateResult -----> user
            (has UpdatedFeed)

lemon24 avatar May 05 '24 10:05 lemon24