learn_python3_spider 第二个爬虫程序报错

错误为： Traceback (most recent call last): File "D:/coding/Python/PyCharm/test1/test2.py", line 127, in main(i) File "D:/coding/Python/PyCharm/test1/test2.py", line 119, in main soup = BeautifulSoup(html, 'lxml') File "C:\Programs\Python\Python38-32\lib\site-packages\bs4_init_.py", line 287, in init elif len(markup) <= 256 and ( TypeError: object of type 'NoneType' has no len()

Jan 14 '20 03:01 lovevantt

代码使用的是提供的代码。 douban_top_250_books.py

Jan 14 '20 03:01 lovevantt

问题相同，有解决出来吗

Mar 11 '20 09:03 Ryyy233

抱歉，我还没来得及做

发送自 Windows 10 版邮件https://go.microsoft.com/fwlink/?LinkId=550986应用

发件人: Ryyy233 [email protected] 发送时间: Wednesday, March 11, 2020 5:30:58 PM 收件人: wistbean/learn_python3_spider [email protected] 抄送: Subscribed [email protected] 主题: Re: [wistbean/learn_python3_spider] 第二个爬虫程序报错 (#7)

问题相同，有解决出来吗

― You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/wistbean/learn_python3_spider/issues/7?email_source=notifications&email_token=AMBMSJI667NYDQ65YCTKNYTRG5K5FA5CNFSM4KGME7W2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEOOZM2A#issuecomment-597530216, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AMBMSJNFB2UWD3FK5O7YYPLRG5K5FANCNFSM4KGME7WQ.

Mar 11 '20 12:03 zhangxy12138

原因：请求豆瓣获取数据失败了，返回了None（看各自的设置，作者的代码是'None'，就采用！=‘None’）解决办法：

  html = request_douban(url)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)
    else:
        print('request_douban return None')

Apr 11 '20 03:04 panhainan

原因：请求豆瓣获取数据失败了，返回了None（看各自的设置，作者的代码是'None'，就采用！=‘None’）解决办法：
  html = request_douban(url)
    if html is not None:
        soup = BeautifulSoup(html, 'lxml')
        save_to_excel(soup)
    else:
        print('request_douban return None')

但是问题还是没有得到实际解决，因为请求失败了。分析原因可能是豆瓣将我们的操作判断为爬虫了，就拦截了，这时候就可以加入headers，来模拟我们是浏览器请求，而不是爬虫。重写request_douban方法：

def request_douban(url):
    headers = {
        # 假装自己是浏览器
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36',
    }
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.text
    except requests.RequestException:
        return None

Apr 11 '20 03:04 panhainan

加个请求头就行了

Jun 03 '20 08:06 Wakerrd

加一个这个，不然你是python request，直接被拦截了 headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36 OPR/66.0.3515.115'}

Apr 01 '21 01:04 McChickenNuggets

这是来自QQ邮箱的假期自动回复邮件。你好，我，无法亲自回复你的邮件。我将在假期结束后，尽快给你回复。最近正在休假中

Sep 14 '23 13:09 856tangbin

这是来自QQ邮箱的假期自动回复邮件。您好，我最近正在休假中，无法亲自回复您的邮件。我将在假期结束后，尽快给您回复。

Sep 14 '23 13:09 bearbeers

learn_python3_spider learn_python3_spider copied to clipboard

第二个爬虫程序报错

learn_python3_spider
learn_python3_spider copied to clipboard