datasette icon indicating copy to clipboard operation
datasette copied to clipboard

Block or rate limit based on User Agent?

Open louispotok opened this issue 1 year ago • 1 comments

I'm getting traffic from facebookexternalhit user agents -- it's not a huge amount (2req/s) but the bill starts to add up. From what I can tell, this is the facebook crawler which does not mention robots.txt (vs FacebookBot which seems to respect it. This SO thread claims that the Crawler doesn't respect robots.txt, so datasette-block-robots doesn't seem to solve this.

Is there another way to block or rate limit a given user agent in datasette? I'm deploying on Google Cloud if that's relevant. Thanks!

louispotok avatar Jun 28 '24 04:06 louispotok

Actually - looks like it just took a few hours for it to pull the latest robots.txt, so the original problem is fixed. Still curious about the original question, in case there are other crawlers that don't respect it.

louispotok avatar Jun 28 '24 06:06 louispotok

I think this feature is out-of-scope for Datasette core, but it could make a great plugin. It would need to use the ASGI middleware plugin, then implement its own state tracking for rate limiting against IP.

It could even use an in-memory SQLite database for the rate limit counters, which could be pretty neat.

simonw avatar Jul 02 '24 05:07 simonw

Makes sense! Just for clarification if I decide to take a stab at it -- do you mean this ASGI plugin? What's the reason you think it's better to use that vs using the asgi_wrapper plugin hook directly?

louispotok avatar Jul 02 '24 06:07 louispotok