Add an option to classify unknown OS and browsers as crawlers
As an attempt to detect non-humans more accurately, an option to classify unknown OS and browsers and crawlers help. In my case, I have 20% visitors with unknown OS and 12% with unknown browser. Here is the top list from last month:
509 python-requests/2.25.1
534 googlebot
563 Mozilla/5.0 (compatible; CensysInspect/1.1; https://about.censys.io/)
567 WordPress/5.1.5; https://takefive.cn
573 http.rb/4.4.1
612 MB-Web-Crawler
624 GoodBot
624 lanaibot please contact [email protected] for information
709 Mozilla/5.0 (compatible; SEOkicks; https://www.seokicks.de/robot.html)
762 https://github.com/blakeembrey/popsicle
768 Mozilla/5.0 (compatible; Neevabot/1.0; https://neeva.com/neevabot)
891 MauiBot (crawler.feedback [email protected])
933 Buck/2.2; ( https://app.hypefactors.com/media-monitoring/about.html)
1119 Scrapy/2.5.0 ( https://scrapy.org)
1341 colly - https://github.com/gocolly/colly
1734 omgili/0.5 http://omgili.com
2766 unirest-java/1.3.11
4341 got (https://github.com/sindresorhus/got)
5352 Go 1.1 package http
7143 newspaper/0.2.8
10197 -
It would be possible to detect more as I still have some more users with IE9 than IE11 (which is not possible, since I am requiring TLS 1.2).
The option could be extended with an optional argument to ignore more
OS and browsers (like "Others" and "Feeds"). So, I am unsure if the
name is the right one. At first, I wanted to extend
--ignore-crawlers with an optional argument, but --crawlers-only
makes it non suitable. Alternatively, we could have
--also-crawlers=Unknown,Others,Feeds.
That would be a step into the wrong direction in my opinion, as most unknown clients are not crawlers - except if you assume "cURL == crawler" which is also wrong. Your PR would make the detection less correct.
As you see in the excerpts of unknown OS/browsers, I don't have curl in top positions, but I have a lot of crawlers. Currently, they are misclassified.
Merged. Thanks for submitting this PR @vincentbernat. I like the idea of having this as an option. I'll be testing this out on my end and see what results I get.