browsertrix-crawler
browsertrix-crawler copied to clipboard
Automatically add exclusion rules based on `robots.txt`
It would be nice if the crawler could automatically fetch rules from robots.txt
and add exclusion
rules for every rule present in the robots.txt
file.
I think this functionality should even be turned-on by default to avoid annoying servers which have clearly expressed what they do not want "external systems" to mess with.
At Kiwix, we have lots of non-tech users configuring zimit to do a browertrix crawl. In most cases, they have no idea what a robots.txt
is, so having the switch turned-on by default would help a lot. That being said, I don't mind if it is off by default, we can do the magic to turn it on by default in zimit ^^