crawler-user-agents icon indicating copy to clipboard operation
crawler-user-agents copied to clipboard

Disambiguate http clients from crawlers/bots

Open srstsavage opened this issue 1 year ago • 2 comments

I was surprised to find http clients like python-requests, Go-http-client, wget, curl, etc included in the crawler list. While I understand that these tools can be abused, in our case a large portion of our legitimate web traffic is from API requests using http clients like these.

For now I think I'll need to create an overriding allow list of patterns and remove matches from agents.Crawlers before processing, but it would be great to be able to disambiguate client tools/libraries based on a field in crawler-user-agents.json. Maybe just an is_client boolean, or a more generic tags string array which could contain client or similar? Any thoughts?

srstsavage avatar Oct 04 '24 16:10 srstsavage

I'm sure I missed a few but looks like the list isn't too long

aiohttp
Apache-HttpClient
^curl
Go-http-client
http_get
httpx
libwww-perl
node-fetch
okhttp
python-requests
Python-urllib
[wW]get

srstsavage avatar Oct 04 '24 16:10 srstsavage

Completely see your point. I like the idea of having optional tags:

"tags": ["generic-client"]

Would you do a pull-request? Thanks!

monperrus avatar Oct 07 '24 06:10 monperrus