icrawler
icrawler copied to clipboard
KeyError:'data' when using BaiduImageCrawler
Traceback (most recent call last):
File "/home/minami/anaconda3/envs/python_function/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "anaconda3/envs/python_function/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/parser.py", line 104, in worker_exec
for task in self.parse(response, **kwargs):
File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/builtin/baidu.py", line 120, in parse
for item in content['data']:
KeyError: 'data'
Hi there! I met this error when using Baidu. Google and Bing are fine. Is there anything that can fix this?
Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.
Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.
just do it
@chinasilva This will yield JSONDecodeError:
Exception in thread parser-001:
Traceback (most recent call last):
File "*/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
self.run()
File "*/lib/python3.11/threading.py", line 975, in run
self._target(*self._args, **self._kwargs)
File "*/lib/python3.11/site-packages/icrawler/parser.py", line 104, in worker_exec
for task in self.parse(response, **kwargs):
File "*/lib/python3.11/site-packages/icrawler/builtin/baidu.py", line 116, in parse
content = json.loads(content, strict=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "*/lib/python3.11/json/__init__.py", line 359, in loads
return cls(**kw).decode(s)
^^^^^^^^^^^^^^^^^^^
File "*/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "*/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
After trying, I find that the following headers work:
headers = {
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'User-Agent':
('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/88.0.4324.104 Safari/537.36'),
}
is an example of how to do this as follows ?
baidu_crawler = BaiduImageCrawler(storage={'root_dir': folder2})
baidu_crawler.session.headers= {
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'User-Agent':
('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
'AppleWebKit/537.36 (KHTML, like Gecko)'
'Chrome/88.0.4324.104 Safari/537.36'),
}
baidu_crawler.crawl(keyword=lookfor, offset=0, max_num=1000,
min_size=(512,512), max_size=None)
Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.
just do it
I do not think the answer is adding Accept-Encoding: gzip, deflate, br
Looks like this uses urllib3. urllib3 can import brotli if you have it installed. I assume brotli would add the "br". Otherwise Accept-Encoding: gzip, deflate, br says I can handle GZIP, ZIP (deflate) and brotli responses. If you do not have brotli, you may get a garbage response.
Accept-Language may work, since most users prefer a specific language. Default headers, other than User-Agent:
'Accept-Encoding': 'gzip, deflate'
'Accept': '*/*'
`'Connection': 'keep-alive'``
And this is what my Firefox 121 sends:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate, br
I would recommend testing to find the minimal requirement for bypassing Baidu problems.
Traceback (most recent call last): File "/home/minami/anaconda3/envs/python_function/lib/python3.7/threading.py", line 926, in _bootstrap_inner self.run() File "anaconda3/envs/python_function/lib/python3.7/threading.py", line 870, in run self._target(*self._args, **self._kwargs) File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/parser.py", line 104, in worker_exec for task in self.parse(response, **kwargs): File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/builtin/baidu.py", line 120, in parse for item in content['data']: KeyError: 'data'Hi there! I met this error when using Baidu. Google and Bing are fine. Is there anything that can fix this?
The response text I got when I see this error is this JSON. In this and another project:
{"antiFlag":1,"message":"Forbid spider access","bfe_log_id":"xxxxxx random numbers xxxxxx"}
The correct answer is probably what @liyufan posted, to send something that Baidu would expect from a real person. This should be an option somewhere, but I think Chinese and English is what Baidu expects. @simonmcnair looks correct to me.