crawly icon indicating copy to clipboard operation
crawly copied to clipboard

Filter out requests when popping from request storage

Open tanguilp opened this issue 3 years ago • 0 comments

Requests are filtered before being added to the request storage, so as to discard irrelevant pages.

When crawling large sites, some filtering rules may be added after crawling is started. It usually involves updating the filters and updating (by restarting, or hot code reloading) the spider (assuming we're using a persistent storage backend).

In this case, some requests saved before the rules' update will be browsed anyway. This would be nice to find a way to filter them out also when popping.

tanguilp avatar Dec 09 '20 14:12 tanguilp