zhihu-py3 icon indicating copy to clipboard operation
zhihu-py3 copied to clipboard

目前知乎服务器会限制Request的次数来反爬虫吗

Open tzhao0311 opened this issue 9 years ago • 11 comments

您好!我是初学者。最近用您开发的API写了一个爬虫,但每次爬到一定数目时就停止了,我想问是因为知乎服务器端有访问限制吗?有具体的解决方法吗?

tzhao0311 avatar Nov 21 '15 23:11 tzhao0311

恩,建议提供一下“爬虫停止”的具体表现。 因为虽然知乎确实会反爬虫,但是一般来说反爬虫措施会直接导致代码出错,而不是“停止”。

7sDream avatar Nov 22 '15 07:11 7sDream

我目前在爬某个用户的follower的url,每当我爬到一定数量时,就会出现类似下面的错误提示,每次可能还不太一样,这次是爬到12万个时出现的错误提示,请问是什么原因。 Traceback (most recent call last): File "/Users/zhaotao/PycharmProjects/zhihu_test_1/user_profile_crawler.py", line 53, in for follower in author.followers: File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 359, in followers File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 405, in _follow_ee_ers File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/common.py", line 103, in wrapper ValueError: Invalid URL

tzhao0311 avatar Nov 22 '15 13:11 tzhao0311

看上去像是bug而不是遇到了访问限制。(不过12万略微是有点多了,还是要注意下) 提供一下你在爬的用户的主页地址吧,有时间的话我测试下。

7sDream avatar Nov 22 '15 13:11 7sDream

知乎现在是利用哪种反爬虫策略啊,限制IP、cookies、访问速度,还是其他策略。这是我在爬的用户的主页地址:http://www.zhihu.com/people/zhang-jia-wei

tzhao0311 avatar Nov 22 '15 13:11 tzhao0311

访问速度太快会封IP,也有可能被封账号,所以建议申请小号加代理来爬。ZhihuClient有个设置HTTP代理的接口。 明天早上我测测看。

7sDream avatar Nov 22 '15 14:11 7sDream

好的,多谢!

tzhao0311 avatar Nov 22 '15 14:11 tzhao0311

这次跑到3万多的时候出现了如下的错误,不知道是不是bug。 Traceback (most recent call last): File "/Users/zhaotao/PycharmProjects/zhihu_test_1/user_profile_crawler.py", line 54, in for follower in author.followers: File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 359, in followers File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 392, in _follow_ee_ers File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/models.py", line 805, in json return complexjson.loads(self.text, **kwargs) File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/init.py", line 319, in loads return _default_decoder.decode(s) File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

tzhao0311 avatar Nov 23 '15 11:11 tzhao0311

看样子不是bug而是知乎看你请求太快,发回了一些错误响应,导致json没法解析。

建议你这样,编程控制一下,爬1000个人,暂停个10秒之类的……反正就暂时手动降低一下请求速度……嗯嗯~

以后我们会处理这个问题的,比如给网络访问加上自动重试机制。(不过还比较遥远……)

7sDream avatar Nov 24 '15 09:11 7sDream

多谢,我试一下,有问题再请教你。

tzhao0311 avatar Nov 24 '15 10:11 tzhao0311

现在每次跑到300多或者400多就出现如下的错误提示,不会是因为我的账号已经被知乎限制了吧 Traceback (most recent call last): File "/Users/zhaotao/PycharmProjects/zhihu_test_1/user_profile_crawler.py", line 55, in for follower in author.followers: File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 359, in followers File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/zhihu_py3-0.3.0-py3.5.egg/zhihu/author.py", line 391, in _follow_ee_ers File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/sessions.py", line 511, in post return self.request('POST', url, data=data, json=json, *_kwargs) File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/sessions.py", line 468, in request resp = self.send(prep, *_send_kwargs) File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/sessions.py", line 576, in send r = adapter.send(request, **kwargs) File "/System/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/requests-2.8.1-py3.5.egg/requests/adapters.py", line 412, in send raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

Process finished with exit code 1

tzhao0311 avatar Nov 24 '15 10:11 tzhao0311

最后一行

requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(54, 'Connection reset by peer'))

表示是知乎重置了链接……有没有被限制帐号不知道……但是确实不是代码的问题而是网站的行为导致的错误……

7sDream avatar Nov 25 '15 12:11 7sDream