ripme
ripme copied to clipboard
[Proposal] Add a globally-applicable table for rate limiting rules on URLs
For link aggregators like Reddit, the ripper shouldn't have to know every URL that could show up. However, Ripme broadly knows.
It might be a good idea to have a globally-applicable (to all rippers) table for rate limiting rules on URLs (in terms of not ripping a new URL for a given domain until the rate limit time has expired) so that rate limiting isn't piecemeal in each ripper's logic.
Broadly-speaking, we don't have any real rate limiting system in place, and any rate limiting has been ad hoc in simply delaying when the link gets added to the queue, which isn't actually the same as delaying after a download completes (consider if the "rate limit" delay is 3 seconds before the link is added to the download queue, but downloads take 10 seconds -- we then pretty immediately have no real rate limiting in place anymore this way).
Also, consider the update scenario. Self-rate-limiting shouldn't be necessary if we check a URL, discover that we already have it, and don't actually rip it. So, we should have a mechanism in place to apply the rate limiting rules only after we successfully rip something.