apache-ultimate-bad-bot-blocker icon indicating copy to clipboard operation
apache-ultimate-bad-bot-blocker copied to clipboard

What is good What is not

Open ZerooCool opened this issue 6 years ago • 11 comments
trafficstars

What is good What is not ?

Sogou is really a " bad bot " ?

In my log : visionduweb.fr:80 220.181.124.85 - - [16/Aug/2019:08:21:44 +0200] "GET /robots.txt HTTP/1.1" 301 473 "-" "Sogou web spider/4.0(+http://www.sogou.com/docs/help/webmasters.htm#07)"

Then, i comment the rule for Sogou #BrowserMatchNoCase "^(.?)(\bcrawl.sogou.com\b)(.)$" bad_bot

I have wrong ?

ZerooCool avatar Aug 16 '19 08:08 ZerooCool

What version of the blocker?

I have tested on 2.2 and 2.4 and it is blocked.

Mitchells-MacBook-Pro:GIT mitchellkrog$ curl -A "Sogou web spider/4.0" -I https://mydomain
HTTP/1.1 403 Forbidden
Date: Fri, 16 Aug 2019 10:03:36 GMT
Server: Apache
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload
Content-Type: text/html; charset=iso-8859-1

mitchellkrogza avatar Aug 16 '19 10:08 mitchellkrogza

Recent version. Yes yes he is also blocked. That's why I noticed in my logs that it was blocked. It is a Chinese browser. I wanted to know why it should be blocked, why is this a bad bot, if its role is to add a site in the search engine https://www.sogou.com

搜狗搜索是全球第三代互动式搜索引擎,支持微信公众号和文章搜索、知乎搜索、英文搜索及翻译等,通过自主研发的人工智能算法为用户提供专业、精准、便捷的搜索服务。

ZerooCool avatar Aug 16 '19 10:08 ZerooCool

You can unblock it yourself. Add these lines to blacklist-user-agents.conf and reload Apache

BrowserMatchNoCase "^(.*?)(\bsogouspider\b)(.*)$" good_bot
BrowserMatchNoCase "^(.*?)(\bSogou\ web\ spider\b)(.*)$" good_bot

mitchellkrogza avatar Aug 16 '19 10:08 mitchellkrogza

Let me know if you don't succeed

mitchellkrogza avatar Aug 16 '19 13:08 mitchellkrogza

In fact, the meaning of my question is: Does this bot really have to be considered a "bad bot"? Indeed, if its goal is to allow content referencing in Chinese engines, what is the criterion that places this bot in bad bot?

A frequency of visit too important? Choosing not to target China for SEO? Something else ?

ZerooCool avatar Aug 16 '19 16:08 ZerooCool

Its nothing against China, I am based in South Africa and don't want my content on Sogou which is why it was blocked from the beginning. But from the beginning I also built in ways for users to override and whitelist something I blacklisted.

mitchellkrogza avatar Aug 16 '19 16:08 mitchellkrogza

When we use the term bad it is not always meant as BAD but more something people do not want or something people may find a nuisance while the list does include actually bad bots.

mitchellkrogza avatar Aug 16 '19 16:08 mitchellkrogza

Thank you for your answer. In fact, I find it just amazing that we can choose not to be referenced on engines, I wondered if Sogou met criteria that deserve to block.

Or, it is a deliberate will of the webmaster, not to be referenced in China, but, this may impact its overall listing of this fact.

If I understand correctly, this is a voluntary choice here, not to refer to a country that uses a language that we do not understand, but, for the English sites, it could make sense of everything from even be referenced more broadly on the web, until China, and I wonder if it does not impact so positively referencing the site.

To conclude, I'm not sure what criteria to base on whether it makes sense, or not, to allow the visit of Sogou's bot, but, it seems to me that allow search engines is surely legitimate.

ZerooCool avatar Aug 17 '19 00:08 ZerooCool

Hi, WeChat builtin browser use the useragent MicroMessenger and this line BrowserMatchNoCase "(?:\b)MicroMessenger(?:\b)" bad_bot is in globalblacklist.conf,the result is WeChat users have no access to the site by default if the site admin use the list and don't realize what this line do, maybe this is a mistake, because there are so many WeChat users and they are not robots.

void285 avatar Aug 15 '21 05:08 void285

Have you tried added MicroMessenger to your custom bypasses? I will review MicroMessenger when I am back at my desktop possible FP

mitchellkrogza avatar Aug 15 '21 05:08 mitchellkrogza

Have you tried added MicroMessenger to your custom bypasses? I will review MicroMessenger when I am back at my desktop possible FP

Yes, I added the line to custom bypasses, and it works. But I do this just now, two years since I first use this list. My site is small and personal, so it doesn't matter much, but there may be some guys who don't realize this problem.

I seached Chinese search engine rank and found this: Chinese search engine rank of 2018 Season 4, and suprisingly found that the 2nd(sogou) and 3rd(haosou/360) search engine both blocked in this list by default, maybe this is a mistake.

I added these lines to blacklist_custom.conf, I'm not sure if crawl.sogou.com is good and what grammar I should use, BrowserMatchNoCase "^(.*?)(\bsogouspider\b)(.*)$" good_bot or BrowserMatchNoCase "(?:\b)sogouspider(?:\b)" good_bot:

BrowserMatchNoCase "(?:\b)360Spider(?:\b)" good_bot
BrowserMatchNoCase "(?:\b)HaosouSpider(?:\b)" good_bot
BrowserMatchNoCase "(?:\b)MicroMessenger(?:\b)" good_bot
BrowserMatchNoCase "(?:\b)Sogou\ web\ spider(?:\b)" good_bot
BrowserMatchNoCase "(?:\b)sogouspider(?:\b)" good_bot
BrowserMatchNoCase "(?:\b)Sosospider(?:\b)" good_bot

BrowserMatchNoCase "(?:\b)crawl.sogou.com(?:\b)" good_bot
BrowserMatchNoCase "^(.*?)(\bSogou\ web\ spider\b)(.*)$" good_bot
BrowserMatchNoCase "^(.*?)(\bsogouspider\b)(.*)$" good_bot

void285 avatar Aug 15 '21 05:08 void285