statify icon indicating copy to clipboard operation
statify copied to clipboard

Optimize bot detection

Open Zodiac1978 opened this issue 3 years ago • 6 comments

At the moment we use some string to detect crawler from the user agent string:

https://github.com/pluginkollektiv/statify/blob/667518428b30b0522367fb2c955d1913e1ef672f/inc/class-statify-frontend.php#L222-L236

We could add some more strings, like seo, crawling and chrome-lighthouse (borrowed from Koko Analytics):

https://github.com/ibericode/koko-analytics/blob/18716dc9156a83e72b2967cec6dee8ce9acfdbe9/assets/src/js/script.js#L53

Looking at the biggest 10 crawlers, I think we get almost all. But Alexa is missing with their ia_archiver.

Maybe something like fetcherand scraper too ...

facebookexternalhit could be another candidate.

Or we could take the big step and use a Third-Party-Library like https://github.com/JayBizzle/Crawler-Detect to detect crawlers.

Another one (in JSON) would be https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json

Zodiac1978 avatar Jul 20 '21 11:07 Zodiac1978

Additionally we could make these identifiers filterable. Advanced users could extend the list on their own usage/experiences.

Zodiac1978 avatar Jan 08 '22 12:01 Zodiac1978

Stumbled upon another short variant for bots: lighthouse|bot|google|baidu|bing|msn|duckduckbot|teoma|slurp|yandex|crawler|spider|robot|crawling

Zodiac1978 avatar Feb 22 '22 12:02 Zodiac1978

bot would already include duckduckbot and robot. And you could summarize crawler and crawling to crawl.

MatzeKitt avatar Feb 22 '22 12:02 MatzeKitt

Yes, I know. These are findings on the web. lighthouse would also match chrome-lighthouse, etc.

This needs consolidation and decision. I am just sharing other projects solutions to bot detection.

Zodiac1978 avatar Feb 22 '22 12:02 Zodiac1978

For example: Alexa will be closed on 1st May 2022 (https://support.alexa.com/hc/en-us/articles/4410503838999), so the ia_archiver seems to not relevant anymore and does not need to be added.

Zodiac1978 avatar Feb 22 '22 13:02 Zodiac1978

More user agent strings sorted by software name: https://developers.whatismybrowser.com/useragents/explore/software_name/

Zodiac1978 avatar Feb 22 '22 13:02 Zodiac1978

Using a composer package like https://github.com/JayBizzle/Crawler-Detect sounds like a good idea to me. what do the others think? @2ndkauboy @krafit @pfefferle @stklcode

florianbrinkmann avatar Mar 19 '23 08:03 florianbrinkmann

Implemented Composer Autoload for JayBizzle/Crawler-Detect and replaced bot detection in class-statify-frontend.php with CrawlerDetect function.

Pull request: #247

00Sleepy avatar Mar 19 '23 10:03 00Sleepy

Added to the 2.0.0 milestone because the composer package needs PHP 5.3 and we are on PHP 5.2 currently.

florianbrinkmann avatar Mar 19 '23 12:03 florianbrinkmann