browsertrix-crawler Ability to change scoping rules on the fly

Ability to change scoping rules on the fly

Open ikreymer opened this issue 2 years ago • 2 comments

Consider an API (via the web server) that could alter the scoping rules mid-crawl, for example, usually to add additional exclusion rules and filter down an existing queue. This can be really useful if a crawler trap is detected, and want to just finish the crawl. The saved config file can then reflect the updated rules.

Mar 04 '22 17:03 ikreymer

Maybe a quick diff on the urls that shows a Levenstein distance of less than 2 or some such, that then triggers another diff of the captured pages. If the difference between the pages is minimal then that means we're just generating pages that are the same except for the link that triggered the crawler trap. Just thinking out loud.

Mar 10 '22 14:03 elotroalex

For anyone else reading this issue: If it is not clear to you yet - you can already manually do this by altering the saved config file and restarting from there. Of course this is not the same as an API, but I have used it with great success for speeding a crawl up by excluding unnecessary urls :)

At least I was mislead by the wording and assumed that it was not possible at the moment, but it is. Maybe change the title to include "by API" or "without stopping"?

Jul 19 '22 13:07 bjrne

browsertrix-crawler browsertrix-crawler copied to clipboard

Ability to change scoping rules on the fly

browsertrix-crawler
browsertrix-crawler copied to clipboard