browsertrix-crawler icon indicating copy to clipboard operation
browsertrix-crawler copied to clipboard

Automatically add exclusion rules based on `robots.txt`

Open benoit74 opened this issue 1 year ago • 5 comments

It would be nice if the crawler could automatically fetch rules from robots.txt and add exclusion rules for every rule present in the robots.txt file.

I think this functionality should even be turned-on by default to avoid annoying servers which have clearly expressed what they do not want "external systems" to mess with.

At Kiwix, we have lots of non-tech users configuring zimit to do a browertrix crawl. In most cases, they have no idea what a robots.txt is, so having the switch turned-on by default would help a lot. That being said, I don't mind if it is off by default, we can do the magic to turn it on by default in zimit ^^

benoit74 avatar Jun 27 '24 07:06 benoit74

Despite its name, robots.txt's usage is to prevent (well just give directions actually) indexation robots from exploring resources. browsertrix-crawler is a technical bot, but it acts as a user and certainly not an indexing bot.

I don't see value in such a feature but I can imagine there are scenarios where it can be useful. @benoit74 do you have one to share?

Without further information, I'd advise on not having this (non existent yet) feature on by default as it changes the browser's behavior while I think this project uses explicit flags for this.

rgaudin avatar Jun 27 '24 09:06 rgaudin

First use case is https://forums.gentoo.org/robots.txt where the robots.txt content indicate with a certain fidelity what we should exclude from a crawl of https://forums.gentoo.org/ website.

Disallow: /cgi-bin/
Disallow: /search.php
Disallow: /admin/
Disallow: /memberlist.php
Disallow: /groupcp.php
Disallow: /statistics.php
Disallow: /profile.php
Disallow: /privmsg.php
Disallow: /login.php
Disallow: /posting.php

The idea behind automatically using robots.txt is helping lazy / not knowledgeable users have a first version of a WARC/ZIM which is lickely to contain only useful content rather than wasting time and resources (ours and upstream server) building a WARC/ZIM with too many unneeded pages.

Currently in self-service mode, users tend to simply input the URL https://forums.gentoo.org/ and say "Zimit!". And this is true for "young" Kiwix editors as well.

After that initial run, it might prove interesting in this case to still include /profile.php (user profiles) in the crawl. At least such a choice probably needs to be discussed by humans. But this kind of refinement can be done in a second step once we realize we miss this.

If we do not automate something on this, it means the self-service approach is mostly doomed to produce only bad archives, which is a bit sad.

benoit74 avatar Jun 27 '24 11:06 benoit74

This confirms that it can be useful in zimit, via an option (that you'd turn on)

rgaudin avatar Jun 27 '24 12:06 rgaudin

We're definitely aware of robots.txt and generally haven't used these as they may be too restrictive for browser-based archiving. However, robots.txt may provide a hint for paths to exclude, as you suggest. The idea would be to gather all the specific Disallow rules, while ignoring something like Disallow: /. Of course, some of the robots rules are URL-specific, but could also apply to in-page block rules as well. An interesting idea - could extend the system sitemap parsing which already parses robots.txt: https://github.com/webrecorder/browsertrix-crawler/blob/main/src/util/sitemapper.ts#L209 and simply parse all of the Disallow and Allow rules to create exclusions and inclusions. Not quite sure how to handle different user agents - perhaps grabbing rules from all of them, or a specific one?

This isn't a priority for us at the moment, but would welcome a PR that does this!

ikreymer avatar Jul 04 '24 20:07 ikreymer

Good points!

This is not a high priority for us either, let's hope we find time to work on it ^^

benoit74 avatar Jul 08 '24 05:07 benoit74

Thank you very much @ikreymer and @tw4l, looking forward to see this in action!

benoit74 avatar Nov 27 '25 07:11 benoit74

@benoit74 No problem! If you'd like, you can starting testing it out with the 1.10.0-beta.0 release!

ikreymer avatar Nov 27 '25 15:11 ikreymer

Documenting for future reference - at this point, robots.txt support in Browsertrix is at the page level only. Pages that are disallowed by per-host robots.txts will be skipped rather than added to the crawl queue. We are not (yet) checking robots.txt for all page resources.

tw4l avatar Nov 27 '25 17:11 tw4l

Yes, I saw that. This is already a significant leap forward, at least from my perspective 😄

benoit74 avatar Nov 27 '25 21:11 benoit74