imuncle.github.io Python爬虫的urllib 418错误

Python爬虫的urllib 418错误

Open imuncle opened this issue 5 years ago • 0 comments

trafficstars

闲着无聊学一学python爬虫，定义了一个获取页面HTML内容的函数：

def askURL(url):
    request = urllib.request.Request(url)#发送请求
    try:
        response = urllib.request.urlopen(request)#取得响应
        html= response.read()#获取网页内容
        # print (html)
        return html
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print (e.code)
        if hasattr(e,"reason"):
            print (e.reason)

结果运行起来输出的是418，说明出现了HTTP Error 418错误。

看到这个错误，就想到可能有反爬虫机制，多半要模拟浏览器访问，直接爬取会被拦截。

于是打开浏览器按f12，随便访问一个网站，选中连接，找Headers，往下拉找到其中User-Agent代表用的哪个请求的浏览器

于是代码修改如下：

def askURL(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
    request = urllib.request.Request(url, headers=headers)#发送请求
    try:
        response = urllib.request.urlopen(request)#取得响应
        html= response.read()#获取网页内容
        # print (html)
        return html
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print (e.code)
        if hasattr(e,"reason"):
            print (e.reason)

就可以成功获取页面内容了！

Apr 20 '20 07:04 imuncle

imuncle.github.io imuncle.github.io copied to clipboard

Python爬虫的urllib 418错误

imuncle.github.io
imuncle.github.io copied to clipboard