statify
statify copied to clipboard
Optimize bot detection
At the moment we use some string to detect crawler from the user agent string:
https://github.com/pluginkollektiv/statify/blob/667518428b30b0522367fb2c955d1913e1ef672f/inc/class-statify-frontend.php#L222-L236
We could add some more strings, like seo
, crawling
and chrome-lighthouse
(borrowed from Koko Analytics):
https://github.com/ibericode/koko-analytics/blob/18716dc9156a83e72b2967cec6dee8ce9acfdbe9/assets/src/js/script.js#L53
Looking at the biggest 10 crawlers, I think we get almost all. But Alexa is missing with their ia_archiver
.
Maybe something like fetcher
and scraper
too ...
facebookexternalhit
could be another candidate.
Or we could take the big step and use a Third-Party-Library like https://github.com/JayBizzle/Crawler-Detect to detect crawlers.
Another one (in JSON) would be https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json
Additionally we could make these identifiers filterable. Advanced users could extend the list on their own usage/experiences.
Stumbled upon another short variant for bots:
lighthouse|bot|google|baidu|bing|msn|duckduckbot|teoma|slurp|yandex|crawler|spider|robot|crawling
bot
would already include duckduckbot
and robot
. And you could summarize crawler
and crawling
to crawl
.
Yes, I know. These are findings on the web. lighthouse
would also match chrome-lighthouse
, etc.
This needs consolidation and decision. I am just sharing other projects solutions to bot detection.
For example: Alexa will be closed on 1st May 2022 (https://support.alexa.com/hc/en-us/articles/4410503838999), so the ia_archiver
seems to not relevant anymore and does not need to be added.
More user agent strings sorted by software name: https://developers.whatismybrowser.com/useragents/explore/software_name/
Using a composer package like https://github.com/JayBizzle/Crawler-Detect sounds like a good idea to me. what do the others think? @2ndkauboy @krafit @pfefferle @stklcode
Implemented Composer Autoload for JayBizzle/Crawler-Detect and replaced bot detection in class-statify-frontend.php with CrawlerDetect function.
Pull request: #247
Added to the 2.0.0 milestone because the composer package needs PHP 5.3 and we are on PHP 5.2 currently.