minoru-fediverse-crawler icon indicating copy to clipboard operation
minoru-fediverse-crawler copied to clipboard

Many instances have restrictive robots.txt

Open Minoru opened this issue 4 years ago • 1 comments

I just implemented support for robots.txt (#4), and I'm seeing a drop in the number of "alive" instances. Apparently Pleroma used to ship a deny-all robots.txt, and these days it's configurable.

I'm happy that this code works, but I'm unhappy that it hurts the statistics this much.

I think I'll deploy this spider as-is, and then start a conversation on what should be done about this. An argument could be made that, since the spider only accesses a fixed number of well-known locations, it should be exempt from robots.txt. OTOH, it's a robot, so robots.txt clearly apply.

Minoru avatar Nov 04 '21 20:11 Minoru

My logs indicate that 2477 nodes forbid access to their NodeInfo using robots.txt. That's a sizeable number, considering there's 7995 instances in my "alive" list at the moment.

Minoru avatar May 10 '22 09:05 Minoru