JobSpiders 你如何运行爬虫？

你如何运行爬虫？

Open Chauncey2 opened this issue 5 years ago • 4 comments

看到作者在一个scrapy中写了很多爬虫，那你是怎么运行的，是一个一个爬虫调用，还是使用异步执行，同时运行多个爬虫？如果是后者，如何解决不同页面结构不同，而pipline.py文件中管道处理？要知道每个网站的结构是不一样的。你是如何实现的，添加标记字段吗？

May 09 '19 14:05 Chauncey2

不是有个main文件吗

May 09 '19 14:05 wqh0109663

我明天有空搞一下异步执行多个吧现在是一次运行一个

May 09 '19 14:05 wqh0109663

我明天有空搞一下异步执行多个吧现在是一次运行一个

main.py

from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings()) process.crawl('job_go') process.crawl('job_python') process.crawl('job_ai') process.crawl('job_arithmetic') process.crawl('job_bigdata') process.start()

写成这样子好像能多个进行

May 10 '19 03:05 jinyue806

这样的话，是不是每个爬虫从各自的网站中提取的数据，必须封装成一样的item对象，然后yield到管道文件中进行处理？否则怎么保证每个爬虫对应的网页数据结构不一样，爬取信息结构不一样而管道文件在处理的时候不会报错呢？

May 12 '19 02:05 Chauncey2

JobSpiders JobSpiders copied to clipboard

你如何运行爬虫？

JobSpiders
JobSpiders copied to clipboard