bdwenku-spider doc格式获取不到源码中url

doc格式获取不到源码中url

Open Guus1115 opened this issue 4 years ago • 2 comments

	# 从源码中批量提取数据url
	all_addr = re.findall(r'wkbos\.bdimg\.com.*?json.*?expire.*?\}',source_html)

这行代码中获取不到值。求解

Jun 18 '20 05:06 Guus1115

已解决修改一下代码 # 从源码中批量提取数据url all_addr = re.findall(r'wkbjcloudbos.bdimg.com.?json.?}',source_html)

Jun 18 '20 05:06 Guus1115

我的输入之后报这样的是什么意思

请输入资源所在的网址:https://wenku.baidu.com/view/a16dfa456e85ec3a87c24028915f804d2b1687f4.html 您输入的url,有误请重新输入! Traceback (most recent call last): File "BDWK.py", line 235, in main File "BDWK.py", line 17, in init File "BDWK.py", line 30, in get_doc_type_and_title UnicodeDecodeError: 'gbk' codec can't decode byte 0xaf in position 424: illegal multibyte sequence

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "BDWK.py", line 271, in File "BDWK.py", line 238, in main AttributeError: module 'os' has no attribute 'exit' [8072] Failed to execute script BDWK

Aug 18 '20 07:08 chuyingithub

bdwenku-spider bdwenku-spider copied to clipboard

doc格式获取不到源码中url

bdwenku-spider
bdwenku-spider copied to clipboard