FreshRSS icon indicating copy to clipboard operation
FreshRSS copied to clipboard

[Feature] Per-feed and per-category minimum polling delays

Open K4LCIFER opened this issue 1 year ago • 14 comments

Is your feature request related to a problem? Please describe. Currently, all feeds refresh at the same time — that is, when a refresh is triggered, all feeds are quickly queried sequentially. This can lead to rate limiting issues. I have seen these rate limits with YouTube and Spotify so far.

Describe the solution you’d like I propose the ability to add a minimum polling rates for feeds. There's a couple types:

  • Per-feed: Before the feed is polled, a delay occurs.
    • This allows the user to specify a delay for a specific feed that may be within a category that doesn't need a delay between all feeds.
  • Per-category: Set a delay for the group so that there is a delay that occurs between every feed in the group.
    • For example: say one has a collection of YouTube feeds in one category. Setting this minimum delay would require a delay to occur between each YouTube feed in the category.

Describe alternatives you’ve considered None.

Additional context None.

K4LCIFER avatar Jun 13 '24 07:06 K4LCIFER

One option at the moment is to make a FreshRSS extension, which listens to feed_before_actualize with a sleep(10) or something like that

Alkarex avatar Jun 13 '24 08:06 Alkarex

Here's an extension to handle this: https://github.com/pe1uca/xExtension-RateLimiter/tree/pre-1.25
I have it working on my personal instance for a site with limit of 100 requests every 10 minutes.

The linked branch is the one which should be used for the latest release, the main branch needs #7007 merged and 1.25 released.

pe1uca avatar Nov 20 '24 23:11 pe1uca

Merged in edge

Alkarex avatar Nov 21 '24 07:11 Alkarex

Refresh rate limiting is a quite important feature for a RSS reader, and I think this issue is worth further considering.

Backgrounds

Currently, FreshRSS only allows user to apply Minimum Interval Limit globally or per feed using "Do not automatically refresh more often than" settings with several relatively large time period as options (15min, 30min, 1h), which I think is not enough for actual use cases.

Issues

Rate Limit on Groups of Feed

In most cases, when we talk about "refresh rate limit", we actually want to limit the refresh rate of a group of feeds toward a specific "source". Most of the time, the "source" has appeared to be a certain host, but this's not always the case.

Case 1

User subscribed to several YouTube video feeds in different categories.

In this case, "source" is youtube.com, and problem could be addressed if we could apply a rate limit to this host. This case also described that why it's not a very good idea to use "Category" as the rate limit group (apply refresh rate limit on certain category), since the feeds are more likely to organized by "content type" (e.g.: News, Entertain) in FreshRSS categories but not "source" (e.g.: Youtube, Reddit)

Case 2

User is using RSS converting service (RSSHub, RSS-Bridge etc.) to subscribe several feeds converted from two different sources exampleA.com and exampleB.com. Assume that RSS converting services runs on convertRSS.com.

In this case, all feeds that sourced from exampleA.com and exampleB.com (or maybe lots of other hosts if user use this converting service to subscribe other sources) will have identical RSS subscription URL host convertRSS.com.

It's obvious that in this case rate limit should not be directly applied to host convertRSS.com. Moreover, we should have a way to apply rate limit to a group of feeds that actually pointed to the same source, for example all RSS feeds converted from exampleA.com.


If I'm not misunderstanding the FreshRSS docs, the "Do not automatically refresh more often than" settings in FreshRSS could not be used to solve issues in above cases.

Lower Request Density

In my personal opinion, being able to rate limit the request frequency in a relatively short period of time is also useful in a lot of cases, please check out the following examples:


Request graph to a certain "source":
x: Request Sent
-: No operation

Long-term Rate Limit

Before: x-x-x------------------x-x-x------------------> Timeline
 After: x-x-x-----------------------------x-x-x-------> Timeline
  • Lower total number of requests in long terms.
  • No effect on request density.

Short-term Rate Limit

Before: x-x-x------------------x-x-x------------------> Timeline
 After: x----x----x------------x----x----x------------> Timeline
  • No effect on total number of requests.
  • Lower request density.

Settings in FreshRSS could be used to achieve a long-term rate limit by setting "Do not automatically refresh more often than", however:

  • It's limiting each feed, not group of feeds with same "source"
  • Still not able to achieve "Short-term Rate Limit", described in example above

Not able to control the "request density" could result in a burst of request in a short period of time towards a certain source website, for example, more than 20 requests to a single source in less than one minutes. This behavior may trigger the rate limit policy of the source website and result in refresh failure.

A Possible Solution

I'm not sure if it's a good idea, but for me I think a new Rate Limit Group mechanism could be used to address issues mentioned above.

Rate Limit Group

Feeds could be considered the basic unit in rate limit system since it's the basic unit of refreshing. We could allow user to create and manage several Rate Limit Groups. Each group contains several feeds.

A Rate Limit Group actually represents a group of feeds that should be considered comes from a single "source", thus, relation between Feeds and Rate Limit Groups should be Many-To-One (one feed should not be considered have multiple "sources").

Optionally, all feeds that not being put into any Rate Limit Group could be put into a default group.

Long-term Rate Limit

Settings like "Minimum Auto-refresh Interval" should be sufficient, and updating this setting could look like updating a default "Do not automatically refresh more often than" option for all feeds inside this group. (Or maybe "Do not automatically refresh more often than" could be renamed to "Minimum Auto-refresh Interval", which looks more understandable for me personally)

Short-term Rate Limit

To achieve such limit, we could limit the minimum interval between two refresh requests of feeds in the same Rate Limit Group.

When Rate Limit Exceeded

For long-term limit, it looks acceptable to just ignore all refresh requests that does not satisfy "Minimum Auto-refresh Interval" limit. (Need more considerations here. For me I think whether we could just ignore these requests could change based on the concrete implementation detail of the rate limit system)

However, for short-term limit, it doesn't make sense to ignore all requests just because the minimum short-term interval is not being satisfied. A slightly large interval (for example 30s ) could result in lots of request being ignored. If possible, all refresh requests of a single Rate Limit Group should be put into a queue and be executed in order with short-term rate limit as request interval.

Multi-user Support

I'm not clear that how FreshRSS handles feeds from different users and what happens if two users subscribe to the same feed. However, the approach mentioned above should work with no problem as long as FreshRSS promise that feed with same id would not fall into two different groups created by the same user (this also assumes that refresh operation is isolated between users, which means each user could perform a refresh, and refresh operation triggered by a certain user is only responsible to refresh feeds related to that user).

Refresh Process

One possible refresh process of a user could be:

  1. Find all Rate Limit Group of this user.
  2. For each group:
    1. Get all feeds in this group.
    2. Filter out those doesn't satisfy long-term limit.
    3. Refresh rest of the feeds one by one with short-term limit interval.

Corresponding pseudo code as below:

def refresh_all_users_feeds():
	users = get_all_users()
	for user in users:
		refresh_user_feeds(user)  # should NOT be run concurrently

def refresh_user_feeds(user):
	semaphore = Semaphore(MAX_CONCURRENT_GROUP_REFRESHING_TASK)
	
	groups = get_groups_of_user(user)
	for group in in_random_order(groups):
		semaphore.acquire(1)  # wait if maximum concurrent limit reached
		start_task_concurrently(
			task=group_refreshing_sub_task(group),
			on_task_finish=semaphore.release(1),
			on_task_error=semaphore.release(1),
		)
	
	# ensure all group refresh task started by this user is already end.
	semaphore.acquire(MAX_CONCURRENT_GROUP_REFRESHING_TASK)
	semaphore.release(MAX_CONCURRENT_GROUP_REFRESHING_TASK)

async def group_refreshing_sub_task(group):
	for feed in in_random_order(group.feeds):
		if not feed.check_long_term_limit():
			continue
		else:
			await sleep(group.short_term_limit)
			refresh(feed)  # actually send requests, also take cares of hooks

def refresh(feed):
	...
	# send request, refreshing feeds
	# take cares of all kinds of hooks
	# update long-term limit timestamp (if there is one)
	# other necessary works

process_example_diagram

I haven't learned about PHP and I'm not sure if this would be a good idea and if it's even feasible when implementing using PHP, maybe there's a better way to design the refresh process when using PHP


Concurrent Task

  • The process group_refreshing_sub_task() should be run concurrently, since there's no problem to send multiple requests at the same time if all these requests come from different "sources".
  • There should be a mechanism controlling maximum number of currently running group_refreshing_sub_task() tasks, since we don't want too many ongoing requests at the same time. (What Semaphore is used for in the pseudo code)

Following are some suggested but not necessary requirements.

  • The process refresh_user_feeds() may not be run concurrently, because two different groups from different users could points to the same "source".
  • For the same reason as above, when doing full-site auto refreshing, we may want to make sure all group refreshing tasks of current user is ended before starting the refresh process of next user.

Random Groups/Feeds Order

Consider MAX_CONCURRENT_GROUP_REFRESHING_TASK to be n. In worst cases, all currently running group refreshing tasks are requesting feeds from same source, which means FreshRSS will have n concurrent ongoing request to the same source. (Propability of this case could become higher when we allow refresh_user_feeds() to run concurrently, and those chosen user all only have one single default Rate Limit Group, and the first feed in those groups are the same, e.g.: the default pre-subscribed FreshRSS feed when first deployed)

Optionally, to lower the probability of such edge case mentioned above, we could process groups and feeds in a random order in refreshing process.


After all, this is just my personal opinion and might not be practical for production. However, I believe implementing a more effective rate-limiting system could significantly enhance FreshRSS usability and improve the overall user experience.

nfnfgo avatar Jan 10 '25 16:01 nfnfgo

But doesn't the extension in https://github.com/FreshRSS/FreshRSS/issues/6556#issuecomment-2489777453 already implement that rate limit group (automatically based on the domain iirc)?

Frenzie avatar Jan 10 '25 17:01 Frenzie

Yes, the extension you mentioned will apply a rate limit per host, but:

  • Grouping criteria fixed to the host of feeds link, which make it unable to address the issue in Rate Limit on Groups of Feed - Case 2 in my comment above (when using RSS converting service)
  • It seems this extension is still unable to apply a short-term rate limit I described in my comment, and there could still be a request burst in a short period of time (e.g.: in 1 min) towards the same source.

nfnfgo avatar Jan 10 '25 18:01 nfnfgo

Also, even if a complete group-based rate limit system is not feasible for now, there could be at least a “global request sending interval” that allow user limit the minimum time interval between two feed’s refresh request as a workaround, though I still think a rate limit system is necessary, especially user subscribed to lots of feeds from one single sources, currently both FreshRSS and the mentioned extensions are not able to prevent a short time request burst towards one single source.

nfnfgo avatar Jan 11 '25 11:01 nfnfgo

@nfnfgo I don't understand what you mean with unable to apply a short-term rate limit
The extension implements a way to prevent more than 50 request to the same domain within the last 5 minutes.
So, when FreshRSS tries to update feeds coming from the same domain in a burst any request beyond the 50th will be cancelled by the extension for them to wait until the next time FreshRSS tries to update them.


Rate Limit on Groups of Feed - Case 2

I just tried it with feedburner taken from feeds from here.
It doesn't matter if I try to mix several sources (i.e. feedburner for siteA, feedburner for siteB), the result is a rate limit of around 800, I also tried with an extra one (i.e. feedburner for siteC along the other two) and the limit is the same, just 800 requests total.

pe1uca avatar Jan 11 '25 22:01 pe1uca

@pe1uca Thanks for your reply!

Minimum Request Interval

About the first question, maybe the word I use “Short-term rate limit” is not clear enough, and maybe is more clear to use some name like “Request Interval Limit”. And the ultimate goal is to limit the maximum request “density” in a short period of time.

With this extension, a similar effect could be achieved when the interval is set short enough, for example, limited to “2 req every 10 sec”, but I’m not sure if the extension allow to set a limit in such a short time period.

Also, there is another issue when using the extension like this, that is the extension will directly interrupt and stop any request that over the limit. This mechanism could cause issues without a proper refresh order/retry mechanism.

Necessity to Manually Manage Group

The example you gave is to limit the request rate on a RSS feed (which’s content is mixed from several other sources), and I believe that is not completely the same case as I described in my comment.

Consider a user have a self-hosted RSS convert site convert.com. This convert site provide features like:

  • convert.com/twitter/username
  • convert.com/github-issue/issue-link

Actually convert service could be used to convert feeds on every website/webpage that doesn’t provide natively RSS feed support.

And now consider user add following feeds in FreshRSS:

  • convert.com/twitter/user1
  • convert.com/twitter/user2
  • convert.com/twitter/user3
  • convert.com/twitter/...
  • convert.com/github-issue/issue-link1
  • convert.com/github-issue/issue-link2
  • convert.com/github-issue/issue-link3
  • convert.com/github-issue/issue-...

Of course we could limited the overall request rate to host convert.com, for example 500req/hour, but we have no way to actually limit our refresh rate toward the real source, here is twitter user feed and github issues page.

In worst cases, there could be 499 requests send to a single actual source per hour.

One workaround is to just lower the overall rate limit to convert.com to something like 200 or 300. But this is not so feasible in production, major reason could be:

  • The extension will directly ignore all requests over the rate limit, if the refresh order is fixed, then some feeds with lower refresh order could always have a high probability to trigger rate limit and being ignored.
  • Consider if users heavily depends on their self-hosted convert services, number of actual source could be large (20+ actual source for example), and the number of feeds could also be large, for example like 100+ feeds all have convert.com host, in this case, there is meaningless to set a overall limit to convert.com website. (For example, if the all feeds that need to be refreshed all comes from different actual sources (twiter, github, youtube), then even if they all have host convert.com, there is no need to rate limit them)

nfnfgo avatar Jan 12 '25 00:01 nfnfgo

Ah, I think I get it, you still want only 50 request every 5 minutes, but you don't want to "spend" the requests immediately, so spreading them out 2 requests every 10 seconds which means you'll "spend" the requests after 250 seconds and then we can wait 50 seconds to reset the counter. Something like that, right?

What would be the benefit of spreading the request over the rate limit window instead of making them as soon as possible?

While developing the extension and testing it on my instance I've seen the rate limit reset while the update process is on going, so due to the processing FreshRSS does on the feed's response there's little chance a huge amount of requests are going to be made in a very short period of time (at least in my setup with SQLite, not sure if changing the DB would make a difference in here).


Also, there is another issue when using the extension like this, that is the extension will directly interrupt and stop any request that over the limit. This mechanism could cause issues without a proper refresh order/retry mechanism.

Well, that's the whole point of a RSS reader, to constantly refresh the data stored locally with the one provided by the feed and present it in a nice way.

The rate limiter extension uses the same mechanism as AutoTTL (which prevents the feed from being updated until it's expected for new content to be available), and this extension has been around for 2 years (according to the repo page), I think if there had been any issues with this mechanism they would have already been reported.

Also, the refresh order is not fixed. You can see here the function to update the feeds, which uses this query ordered by last update, so FreshRSS tries to update them from oldest update to newest.
So when AutoTTL or rate limit prevent a feed from being updated it's not kicked back to the last of the line to update, it'll might as well be the first one to be retried.


I think it's responsibility of the service at convert.com to not rate limit the sites it's set to track if a good amount of pages is set for each one of them.
And it should send a proper response stating the content wasn't able to be updated, either by the same 429 (would set the extension to prevent any further update), or by a 304 (would allow the extension to keep updating up to the configured number of overall requests to convert.com).

Probably an improvement on the extension would be to configure different limits on each site, or even ignore the site so we can hit convert.com as many times as we'd like while still tracking direct request to other sites with known rate limits.

pe1uca avatar Jan 12 '25 02:01 pe1uca

at least in my setup with SQLite, not sure if changing the DB would make a difference in here

Unlikely, but different hardware or allocations could mean significant speedups depending on what you're currently running on.

Frenzie avatar Jan 12 '25 08:01 Frenzie

Thanks for your reply.

Issues that could be considered addressed

What would be the benefit of spreading the request over the rate limit window instead of making them as soon as possible?

About this question, I'm actually just afraid the auto-refresh could cause a burst of number of requests towards a single source in short period of time, and even if the total amount of request is the same, it could still be considered a gentler way to separate the requests to make it have a lower density. Or in other word, lower density may allow users to subscribe to more feeds in the same environments compared to a refresh strategy with higher request density. But after all it's just my personal opinion, and the thought is not being tested/verified.

And now given that what you said below:

so due to the processing FreshRSS does on the feed's response there's little chance a huge number of requests are going to be made in a very short period of time

If this is the case, then I think maybe there is no necessary (or at least not in high priority) to implement a completely new features to limit minimum request interval if current mechanism is working with no problem.

Maybe a global rate limit option could still be provided as an insurance (if it's simple to implement), so if some users think the request density is too high, or if they just want to lower the request density, they could set a global minimum request limit like 5-10s.


Also, the refresh order is not fixed. You can see here the function to update the feeds, which uses this query ordered by last update, so FreshRSS tries to update them from oldest update to newest.

Thank you for telling me this, I didn't look into the relevant code, and the official docs seems also doesn't mentioned about this. After knowing this, there would be no more concern about "some feeds always being ignored" even if extension ignored the over-rated requests.

My Thought on Improvement of the Extension

Now about the "RSS converter" related issues. I think you are right, and the converter itself should be responsible to control the request frequency to different sources.

Probably an improvement on the extension would be to configure different limits on each site or even ignore the site so we can hit convert.com as many times as we'd like while still tracking direct request to other sites with known rate limits.

I think this is a good idea, and taking into considerations that this improvement will require a separate concern toward each host, that is:

  • Feeds with identical host is actually sharing a single one limitation.
  • Limitation configuration of each host could be edit independently.

So if want to achieve the feature mentioned in the improvement, a separated group configuration system is nearly unavoidable in my personal opinion.

But if you decided to do such work, allowing user to configure different limit for different host group, in my personal opinion, maybe it's also a good idea to not limit the grouping criteria to host and allow user using other grouping criteria, this could allow this extension to provide a better functionality without a large increase in the amount of works.

Currently, extension is using the feeds URL into a host when distribute the request amount limit:

youtube.com <-- https://youtube.com/...
x.com <-- https://x.com/...

There could be several ways that we allow user to set their own grouping criteria, and using regex match is one of the possible examples:

Maybe we could allow user to set a regex when configuring the group, the configuration could be looks like:

- Group Name: ...
- Grouping Criteria: HOST
- Grouping Criteria: youtube.com
- Limitation: 50 req/h

- Group Name: ...
- Grouping Criteria: Regex
- Grouping Criteria: converter.com/twitter/.*
- Limitation: 50 req/h

Introducing regex match features could give user really high and complete control on how feeds URL are grouped, and by using regex correctly, RSS-converter issue could be solved by simply writing regex like:

^convert.com/twitter/.*$

However, while host could be directly extracted from feeds URL, using this method may require loading group configurations every time extension dealing with a refresh request, and this could make the process of the extension more complex, so I'm still not sure if this idea is feasible. Also, there need to be a further consideration on how to deal with feeds URL that hits more then one group, so maybe it need user to also set a priority to each group.

If we want optimize, we could just calculate all feeds URL, find it's groups and store the result every time the config is being update, and when dealing with new refresh request, we just used the calculated result, and there would be no need to match URL with grouping criteria of every groups

And if a grouping system is provided, if we want to not limit a host/group, we could just set a really large number limit,

Group: convert.com
Limit: 99999req/min

Which I think will essentially have the same effect as convert.com being excluded for the rate limit, so I think there is no need for a separated "exclude host" feature if a group configuration system is provided.

nfnfgo avatar Jan 12 '25 08:01 nfnfgo

Thank you for telling me this, I didn't look into the relevant code, and the official docs seems also doesn't mentioned about this. After knowing this, there would be no more concern about "some feeds always being ignored" even if extension ignored the over-rated requests.

Also note that HTTP headers like Expires and Last-Modified influence behavior. If the server sets expires for five minutes from now it can essentially always expect a request (from FreshRSS which implements its own much higher limit per feed I mean; of course it might help with a program that checks every few seconds), and if it doesn't like that it should set expires to for example an hour. Then if last-modified is the same it will update the new expires without asking for the full feed.

In principle that should result in a fair bit of spread as well, provided the server returns halfway sane values.

Frenzie avatar Jan 12 '25 09:01 Frenzie

I just want an easy way to set the update value for every feed in a group to a value. I am not looking for the more complex case. Just an easy way to set a value feed instead of opening each feed and setting its update value individually.

pcause avatar Apr 14 '25 11:04 pcause

Related: Tests and feedback welcome:

  • https://github.com/FreshRSS/FreshRSS/pull/7760

Alkarex avatar Jul 27 '25 14:07 Alkarex