goaccess icon indicating copy to clipboard operation
goaccess copied to clipboard

Add an option to classify unknown OS and browsers as crawlers

Open vincentbernat opened this issue 4 years ago • 2 comments

As an attempt to detect non-humans more accurately, an option to classify unknown OS and browsers and crawlers help. In my case, I have 20% visitors with unknown OS and 12% with unknown browser. Here is the top list from last month:

    509 python-requests/2.25.1
    534 googlebot
    563 Mozilla/5.0 (compatible; CensysInspect/1.1;  https://about.censys.io/)
    567 WordPress/5.1.5; https://takefive.cn
    573 http.rb/4.4.1
    612 MB-Web-Crawler
    624 GoodBot
    624 lanaibot please contact [email protected] for information
    709 Mozilla/5.0 (compatible; SEOkicks;  https://www.seokicks.de/robot.html)
    762 https://github.com/blakeembrey/popsicle
    768 Mozilla/5.0 (compatible; Neevabot/1.0;  https://neeva.com/neevabot)
    891 MauiBot (crawler.feedback [email protected])
    933 Buck/2.2; ( https://app.hypefactors.com/media-monitoring/about.html)
   1119 Scrapy/2.5.0 ( https://scrapy.org)
   1341 colly - https://github.com/gocolly/colly
   1734 omgili/0.5  http://omgili.com
   2766 unirest-java/1.3.11
   4341 got (https://github.com/sindresorhus/got)
   5352 Go 1.1 package http
   7143 newspaper/0.2.8
  10197 -

It would be possible to detect more as I still have some more users with IE9 than IE11 (which is not possible, since I am requiring TLS 1.2).

The option could be extended with an optional argument to ignore more OS and browsers (like "Others" and "Feeds"). So, I am unsure if the name is the right one. At first, I wanted to extend --ignore-crawlers with an optional argument, but --crawlers-only makes it non suitable. Alternatively, we could have --also-crawlers=Unknown,Others,Feeds.

vincentbernat avatar May 29 '21 10:05 vincentbernat

That would be a step into the wrong direction in my opinion, as most unknown clients are not crawlers - except if you assume "cURL == crawler" which is also wrong. Your PR would make the detection less correct.

dertuxmalwieder avatar Jun 08 '21 19:06 dertuxmalwieder

As you see in the excerpts of unknown OS/browsers, I don't have curl in top positions, but I have a lot of crawlers. Currently, they are misclassified.

vincentbernat avatar Jun 08 '21 21:06 vincentbernat

Merged. Thanks for submitting this PR @vincentbernat. I like the idea of having this as an option. I'll be testing this out on my end and see what results I get.

allinurl avatar Dec 02 '22 00:12 allinurl