browsertrix-crawler Automatically add exclusion rules based on `robots.txt`

Automatically add exclusion rules based on `robots.txt`

Open benoit74 opened this issue 7 months ago • 5 comments

It would be nice if the crawler could automatically fetch rules from robots.txt and add exclusion rules for every rule present in the robots.txt file.

I think this functionality should even be turned-on by default to avoid annoying servers which have clearly expressed what they do not want "external systems" to mess with.

At Kiwix, we have lots of non-tech users configuring zimit to do a browertrix crawl. In most cases, they have no idea what a robots.txt is, so having the switch turned-on by default would help a lot. That being said, I don't mind if it is off by default, we can do the magic to turn it on by default in zimit ^^

Jun 27 '24 07:06 benoit74

browsertrix-crawler browsertrix-crawler copied to clipboard

Automatically add exclusion rules based on `robots.txt`

browsertrix-crawler
browsertrix-crawler copied to clipboard