icrawler icon indicating copy to clipboard operation
icrawler copied to clipboard

KeyError:'data' when using BaiduImageCrawler

Open tpnam0901 opened this issue 3 years ago • 6 comments

Traceback (most recent call last):
  File "/home/minami/anaconda3/envs/python_function/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "anaconda3/envs/python_function/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/builtin/baidu.py", line 120, in parse
    for item in content['data']:
KeyError: 'data'

Hi there! I met this error when using Baidu. Google and Bing are fine. Is there anything that can fix this?

tpnam0901 avatar Nov 10 '21 13:11 tpnam0901

Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.

waduhekx avatar Apr 25 '22 07:04 waduhekx

Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.

just do it

image

chinasilva avatar Sep 22 '22 02:09 chinasilva

@chinasilva This will yield JSONDecodeError:

Exception in thread parser-001:
Traceback (most recent call last):
  File "*/lib/python3.11/threading.py", line 1038, in _bootstrap_inner
    self.run()
  File "*/lib/python3.11/threading.py", line 975, in run
    self._target(*self._args, **self._kwargs)
  File "*/lib/python3.11/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "*/lib/python3.11/site-packages/icrawler/builtin/baidu.py", line 116, in parse
    content = json.loads(content, strict=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/__init__.py", line 359, in loads
    return cls(**kw).decode(s)
           ^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "*/lib/python3.11/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

After trying, I find that the following headers work:

headers = {
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'User-Agent':
    ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
     'AppleWebKit/537.36 (KHTML, like Gecko)'
     'Chrome/88.0.4324.104 Safari/537.36'),
}

liyufan avatar Mar 20 '23 11:03 liyufan

is an example of how to do this as follows ?

baidu_crawler = BaiduImageCrawler(storage={'root_dir': folder2})
baidu_crawler.session.headers= {
    'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
    'User-Agent':
    ('Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
     'AppleWebKit/537.36 (KHTML, like Gecko)'
     'Chrome/88.0.4324.104 Safari/537.36'),
}
baidu_crawler.crawl(keyword=lookfor, offset=0, max_num=1000,
                    min_size=(512,512), max_size=None)

simonmcnair avatar Jul 18 '23 11:07 simonmcnair

Hi, I met this error too. Did you solved this? Look forward to your reply. Thanks.

just do it

I do not think the answer is adding Accept-Encoding: gzip, deflate, br

Looks like this uses urllib3. urllib3 can import brotli if you have it installed. I assume brotli would add the "br". Otherwise Accept-Encoding: gzip, deflate, br says I can handle GZIP, ZIP (deflate) and brotli responses. If you do not have brotli, you may get a garbage response.

urllib3 Response

Accept-Language may work, since most users prefer a specific language. Default headers, other than User-Agent:

'Accept-Encoding': 'gzip, deflate' 'Accept': '*/*' `'Connection': 'keep-alive'``

And this is what my Firefox 121 sends:

Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8 Accept-Language: en-US,en;q=0.5 Accept-Encoding: gzip, deflate, br

I would recommend testing to find the minimal requirement for bypassing Baidu problems.

Patty-OFurniture avatar Jan 04 '24 01:01 Patty-OFurniture

Traceback (most recent call last):
  File "/home/minami/anaconda3/envs/python_function/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "anaconda3/envs/python_function/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/parser.py", line 104, in worker_exec
    for task in self.parse(response, **kwargs):
  File "anaconda3/envs/python_function/lib/python3.7/site-packages/icrawler/builtin/baidu.py", line 120, in parse
    for item in content['data']:
KeyError: 'data'

Hi there! I met this error when using Baidu. Google and Bing are fine. Is there anything that can fix this?

The response text I got when I see this error is this JSON. In this and another project:

{"antiFlag":1,"message":"Forbid spider access","bfe_log_id":"xxxxxx random numbers xxxxxx"}

The correct answer is probably what @liyufan posted, to send something that Baidu would expect from a real person. This should be an option somewhere, but I think Chinese and English is what Baidu expects. @simonmcnair looks correct to me.

Patty-OFurniture avatar Jan 04 '24 02:01 Patty-OFurniture