scrapy_for_zh_wiki
scrapy_for_zh_wiki copied to clipboard
基于scrapy的层次优先队列方法爬取中文维基百科,并自动抽取结构和半结构数据
提出一点建议
按代码逻辑来看是dfs而不是bfs,同时用全局队列会导致同一个页面被访问多次, 改了以后速度从 0.3page/1s -> 8page/1s
- 错误:module 'queue' has no attribute 'put' 注意queue.py的名字可改为queue1.py - 有三个markdown文件启动爬虫前请照做创建好目录与要求文件 - 代码里有些地方origin拼写错误,注意目录名字
``` 2021-02-26 12:11:22 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: counselor) 2021-02-26 12:11:22 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.3 (default,...
C:\Users\14352\.conda\envs\reptile\python.exe F:\spider\wiki_real\wiki爬取教程\counselor\main.py 2023-10-17 20:35:17 [scrapy.utils.log] INFO: Scrapy 2.11.0 started (bot: counselor) 2023-10-17 20:35:17 [scrapy.utils.log] INFO: Versions: lxml 4.9.3.0, libxml2 2.10.3, cssselect 1.2.0, parsel 1.8.1, w3lib 2.1.2, Twisted 22.10.0, Python 3.10.13...
2023-05-24 15:44:49 [scrapy.core.scraper] ERROR: Error downloading Traceback (most recent call last): File "/Users/luhongyang/opt/anaconda3/python.app/Contents/lib/python3.9/site-packages/tldextract/cache.py", line 190, in run_and_cache result = self.get(namespace=namespace, key=key_args) File "/Users/luhongyang/opt/anaconda3/python.app/Contents/lib/python3.9/site-packages/tldextract/cache.py", line 93, in get raise KeyError("namespace: "...