browsertrix-crawler
browsertrix-crawler copied to clipboard
Ability to change scoping rules on the fly
Consider an API (via the web server) that could alter the scoping rules mid-crawl, for example, usually to add additional exclusion rules and filter down an existing queue. This can be really useful if a crawler trap is detected, and want to just finish the crawl. The saved config file can then reflect the updated rules.
Maybe a quick diff
on the urls that shows a Levenstein distance of less than 2 or some such, that then triggers another diff
of the captured pages. If the difference between the pages is minimal then that means we're just generating pages that are the same except for the link that triggered the crawler trap. Just thinking out loud.
For anyone else reading this issue: If it is not clear to you yet - you can already manually do this by altering the saved config file and restarting from there. Of course this is not the same as an API, but I have used it with great success for speeding a crawl up by excluding unnecessary urls :)
At least I was mislead by the wording and assumed that it was not possible at the moment, but it is. Maybe change the title to include "by API" or "without stopping"?