Many instances have restrictive robots.txt
I just implemented support for robots.txt (#4), and I'm seeing a drop in the number of "alive" instances. Apparently Pleroma used to ship a deny-all robots.txt, and these days it's configurable.
I'm happy that this code works, but I'm unhappy that it hurts the statistics this much.
I think I'll deploy this spider as-is, and then start a conversation on what should be done about this. An argument could be made that, since the spider only accesses a fixed number of well-known locations, it should be exempt from robots.txt. OTOH, it's a robot, so robots.txt clearly apply.
My logs indicate that 2477 nodes forbid access to their NodeInfo using robots.txt. That's a sizeable number, considering there's 7995 instances in my "alive" list at the moment.