imuncle.github.io
imuncle.github.io copied to clipboard
Python爬虫的urllib 418错误
trafficstars
闲着无聊学一学python爬虫,定义了一个获取页面HTML内容的函数:
def askURL(url):
request = urllib.request.Request(url)#发送请求
try:
response = urllib.request.urlopen(request)#取得响应
html= response.read()#获取网页内容
# print (html)
return html
except urllib.error.URLError as e:
if hasattr(e,"code"):
print (e.code)
if hasattr(e,"reason"):
print (e.reason)
结果运行起来输出的是418,说明出现了HTTP Error 418错误。
看到这个错误,就想到可能有反爬虫机制,多半要模拟浏览器访问,直接爬取会被拦截。
于是打开浏览器按f12,随便访问一个网站,选中连接,找Headers,往下拉找到其中User-Agent代表用的哪个请求的浏览器

于是代码修改如下:
def askURL(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
request = urllib.request.Request(url, headers=headers)#发送请求
try:
response = urllib.request.urlopen(request)#取得响应
html= response.read()#获取网页内容
# print (html)
return html
except urllib.error.URLError as e:
if hasattr(e,"code"):
print (e.code)
if hasattr(e,"reason"):
print (e.reason)
就可以成功获取页面内容了!