datasette
datasette copied to clipboard
Block or rate limit based on User Agent?
I'm getting traffic from facebookexternalhit user agents -- it's not a huge amount (2req/s) but the bill starts to add up. From what I can tell, this is the facebook crawler which does not mention robots.txt (vs FacebookBot which seems to respect it. This SO thread claims that the Crawler doesn't respect robots.txt, so datasette-block-robots doesn't seem to solve this.
Is there another way to block or rate limit a given user agent in datasette? I'm deploying on Google Cloud if that's relevant. Thanks!
Actually - looks like it just took a few hours for it to pull the latest robots.txt, so the original problem is fixed. Still curious about the original question, in case there are other crawlers that don't respect it.
I think this feature is out-of-scope for Datasette core, but it could make a great plugin. It would need to use the ASGI middleware plugin, then implement its own state tracking for rate limiting against IP.
It could even use an in-memory SQLite database for the rate limit counters, which could be pretty neat.
Makes sense! Just for clarification if I decide to take a stab at it -- do you mean this ASGI plugin? What's the reason you think it's better to use that vs using the asgi_wrapper plugin hook directly?