Crawer icon indicating copy to clipboard operation
Crawer copied to clipboard

p92 段子爬取,第一页正则匹配好像没完全匹配上,只匹配到了17个,但是菜鸟工具看正则匹配到了20个

Open Mathhub6 opened this issue 1 year ago • 1 comments

https://xiaohua.zol.com.cn/baoxiaonannv/1.html

运行代码

# 导入模块
import logging

# 匹配内容
import re

# 网页请求
import requests

# 忽略警告
logging.captureWarnings(True)
# 控制时间
import time

# 写入请求网址与请求头
url = "https://xiaohua.zol.com.cn/baoxiaonannv/%d.html"
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36",
}


# 正则表达式
pattern = re.compile(r'<div class="summary-text">(.*?)</div>')


duanzi = url % (1)
print(duanzi)
requests.packages.urllib3.disable_warnings()
# 获取代码内容,cerify=False不认证
response = requests.get(url=duanzi, headers=header, verify=False, timeout=10).text
# 正则匹配
item = pattern.findall(response, re.S)
time.sleep(2)

response
# print(item)

image

通过正则表达式<div class="summary-text">(.*?)</div>照理来说应该这20个都匹配到了,但是为什么这3个没有匹配到?re.S似乎能含\n但是没有制表符\t。是这个问题吗?那正则表达式该怎么改使得\t也能被匹配 image image

image image

image

Mathhub6 avatar Jan 15 '24 09:01 Mathhub6