learn_python3_spider icon indicating copy to clipboard operation
learn_python3_spider copied to clipboard

第二个爬虫程序报错

Open lovevantt opened this issue 5 years ago • 7 comments

错误为: Traceback (most recent call last): File "D:/coding/Python/PyCharm/test1/test2.py", line 127, in main(i) File "D:/coding/Python/PyCharm/test1/test2.py", line 119, in main soup = BeautifulSoup(html, 'lxml') File "C:\Programs\Python\Python38-32\lib\site-packages\bs4_init_.py", line 287, in init elif len(markup) <= 256 and ( TypeError: object of type 'NoneType' has no len()

lovevantt avatar Jan 14 '20 03:01 lovevantt

代码使用的是提供的代码。 douban_top_250_books.py

lovevantt avatar Jan 14 '20 03:01 lovevantt

问题相同,有解决出来吗

Ryyy233 avatar Mar 11 '20 09:03 Ryyy233

抱歉,我还没来得及做

发送自 Windows 10 版邮件https://go.microsoft.com/fwlink/?LinkId=550986应用


发件人: Ryyy233 [email protected] 发送时间: Wednesday, March 11, 2020 5:30:58 PM 收件人: wistbean/learn_python3_spider [email protected] 抄送: Subscribed [email protected] 主题: Re: [wistbean/learn_python3_spider] 第二个爬虫程序报错 (#7)

问题相同,有解决出来吗

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/wistbean/learn_python3_spider/issues/7?email_source=notifications&email_token=AMBMSJI667NYDQ65YCTKNYTRG5K5FA5CNFSM4KGME7W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOOZM2A#issuecomment-597530216, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMBMSJNFB2UWD3FK5O7YYPLRG5K5FANCNFSM4KGME7WQ.

zhangxy12138 avatar Mar 11 '20 12:03 zhangxy12138

原因: 请求豆瓣获取数据失败了,返回了None(看各自的设置,作者的代码是'None',就采用!=‘None’) 解决办法:

  html = request_douban(url)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)
    else:
        print('request_douban return None')

panhainan avatar Apr 11 '20 03:04 panhainan

原因: 请求豆瓣获取数据失败了,返回了None(看各自的设置,作者的代码是'None',就采用!=‘None’) 解决办法:

  html = request_douban(url)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)
    else:
        print('request_douban return None')

但是问题还是没有得到实际解决,因为请求失败了。 分析原因可能是豆瓣将我们的操作判断为爬虫了,就拦截了,这时候就可以加入headers,来模拟我们是浏览器请求,而不是爬虫。 重写request_douban方法:

def request_douban(url):
    headers = {
        # 假装自己是浏览器
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36',
    }
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
    except requests.RequestException:
        return None

panhainan avatar Apr 11 '20 03:04 panhainan

加个请求头就行了

Wakerrd avatar Jun 03 '20 08:06 Wakerrd

加一个这个,不然你是python request,直接被拦截了 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 OPR/66.0.3515.115'}

McChickenNuggets avatar Apr 01 '21 01:04 McChickenNuggets

这是来自QQ邮箱的假期自动回复邮件。你好,我,无法亲自回复你的邮件。我将在假期结束后,尽快给你回复。最近正在休假中

856tangbin avatar Sep 14 '23 13:09 856tangbin

这是来自QQ邮箱的假期自动回复邮件。   您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。

bearbeers avatar Sep 14 '23 13:09 bearbeers