varnish-devicedetect
varnish-devicedetect copied to clipboard
Extend bot rules
- simplify / use more generic bot rules
- add extra bots (ia_archiver, gtmetrix, lighthouse)
Bonjour @jocel1,
Since your change is doing two distinct things, I would rather see two commits. There's also no explanation or justification for why we should generalize certain rules. Not being a historical maintainer of this project, I can't tell why choices were made and whether it's a good idea to challenge them.
One thing you could do for example is share a list of user agents to add test coverage, to make sure we don't break previous expectations.
Hi @Dridi!
For the first one : (?i)(ads|google|bing|msn|yandex|baidu|ro|career|seznam|)bot is stricly equivalent to (?i)bot since we have at the end and empty "|" condition
The main reason to add "google" is to cover Google Adsense user-agent: Mediapartners-Google. I also checked google pixels don't have "google" in their user agent, but we could perhaps add just this one.
For spider, I often discover new bots like Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; Bytespider; https://zhanzhang.toutiao.com/), Mozilla/5.0 (compatible; seoscanners.net/1; [email protected]) or CheckMarkNetwork/1.0 (+http://www.checkmarknetwork.com/spider.html), so having a generic "spider" was easier, and seems to be safe like "bot".
ia_archiver is a common bot https://user-agents.net/string/ia-archiver
I also changed facebook to match
user-agent: facebookcatalog/1.0
For the last one : (?i)(web)crawler the syntax sounds like (?i)(web)?crawler was expected, to match for example:
user-agent: Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)
For gtmetrix / lighthouse I don't know if we should see them as bot or not, perhaps create a new category for those ones, like "synthetic-bot" ? (we could add in them "Synthetic" to match dynatrace as well)