imuncle.github.io icon indicating copy to clipboard operation
imuncle.github.io copied to clipboard

Python爬虫的urllib 418错误

Open imuncle opened this issue 5 years ago • 0 comments
trafficstars

闲着无聊学一学python爬虫,定义了一个获取页面HTML内容的函数:

def askURL(url):
    request = urllib.request.Request(url)#发送请求
    try:
        response = urllib.request.urlopen(request)#取得响应
        html= response.read()#获取网页内容
        # print (html)
        return html
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print (e.code)
        if hasattr(e,"reason"):
            print (e.reason)

结果运行起来输出的是418,说明出现了HTTP Error 418错误

看到这个错误,就想到可能有反爬虫机制,多半要模拟浏览器访问,直接爬取会被拦截。

于是打开浏览器按f12,随便访问一个网站,选中连接,找Headers,往下拉找到其中User-Agent代表用的哪个请求的浏览器

image

于是代码修改如下:

def askURL(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
    request = urllib.request.Request(url, headers=headers)#发送请求
    try:
        response = urllib.request.urlopen(request)#取得响应
        html= response.read()#获取网页内容
        # print (html)
        return html
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print (e.code)
        if hasattr(e,"reason"):
            print (e.reason)

就可以成功获取页面内容了!

imuncle avatar Apr 20 '20 07:04 imuncle