statify
statify copied to clipboard
Optimize bot detection
At the moment we use some string to detect crawler from the user agent string:
https://github.com/pluginkollektiv/statify/blob/667518428b30b0522367fb2c955d1913e1ef672f/inc/class-statify-frontend.php#L222-L236
We could add some more strings, like seo, crawling and chrome-lighthouse
(borrowed from Koko Analytics):
https://github.com/ibericode/koko-analytics/blob/18716dc9156a83e72b2967cec6dee8ce9acfdbe9/assets/src/js/script.js#L53
Looking at the biggest 10 crawlers, I think we get almost all. But Alexa is missing with their ia_archiver.
Maybe something like fetcherand scraper too ...
facebookexternalhit could be another candidate.
Or we could take the big step and use a Third-Party-Library like https://github.com/JayBizzle/Crawler-Detect to detect crawlers.
Another one (in JSON) would be https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json
Additionally we could make these identifiers filterable. Advanced users could extend the list on their own usage/experiences.
Stumbled upon another short variant for bots:
lighthouse|bot|google|baidu|bing|msn|duckduckbot|teoma|slurp|yandex|crawler|spider|robot|crawling
bot would already include duckduckbot and robot. And you could summarize crawler and crawling to crawl.
Yes, I know. These are findings on the web. lighthouse would also match chrome-lighthouse, etc.
This needs consolidation and decision. I am just sharing other projects solutions to bot detection.
For example: Alexa will be closed on 1st May 2022 (https://support.alexa.com/hc/en-us/articles/4410503838999), so the ia_archiver seems to not relevant anymore and does not need to be added.
More user agent strings sorted by software name: https://developers.whatismybrowser.com/useragents/explore/software_name/
Using a composer package like https://github.com/JayBizzle/Crawler-Detect sounds like a good idea to me. what do the others think? @2ndkauboy @krafit @pfefferle @stklcode
Implemented Composer Autoload for JayBizzle/Crawler-Detect and replaced bot detection in class-statify-frontend.php with CrawlerDetect function.
Pull request: #247
Added to the 2.0.0 milestone because the composer package needs PHP 5.3 and we are on PHP 5.2 currently.