Scrapy_IPProxyPool
Scrapy_IPProxyPool copied to clipboard
由于目标计算机积极拒绝,无法连接
trafficstars
如题。
先贴上两张Log截图

以下是settings.py的所有配置,不确定是我配置的有问题,还是其他地方出了问题,只要运行起来,刚开始可以爬到一些代理IP并存入数据库里,然后就开始自动进入我自己的爬虫程序。
BOT_NAME = 'CommoditySpider'
SPIDER_MODULES = ['CommoditySpider.spiders']
NEWSPIDER_MODULE = 'CommoditySpider.spiders'
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36'
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 2
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'text/html;charset=UTF-8',
'Cache-Control': 'no-cache',
}
ITEM_PIPELINES = {
'CommoditySpider.aliexpresslines.pipelines.AliExpressPipeline': 300
}
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 3
AUTOTHROTTLE_MAX_DELAY = 60
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
DOWNLOADER_MIDDLEWARES = {
# 第二行的填写规则
# yourproject.myMiddlewares(文件名).middleware类
# 设置 User-Agent
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400,
}
# 默认使用 IP 代理池
if IF_USE_PROXY:
DOWNLOADER_MIDDLEWARES = {
# 第二行的填写规则
# yourproject.myMiddlewares(文件名).middleware类
# 设置 User-Agent
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
'proxyPool.scrapy.RandomUserAgentMiddleware.RandomUserAgentMiddleware': 400,
# 设置代理
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None,
'proxyPool.scrapy.middlewares.ProxyMiddleware': 100,
# 设置自定义捕获异常中间层
'proxyPool.scrapy.middlewares.CatchExceptionMiddleware': 105,
# 设置自定义重连中间件
'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None,
'proxyPool.scrapy.middlewares.RetryMiddleware': 95,
}```
如题。 先贴上两张Log截图
![]()
以下是
settings.py的所有配置,不确定是我配置的有问题,还是其他地方出了问题,只要运行起来,刚开始可以爬到一些代理IP并存入数据库里,然后就开始自动进入我自己的爬虫程序。BOT_NAME = 'CommoditySpider' SPIDER_MODULES = ['CommoditySpider.spiders'] NEWSPIDER_MODULE = 'CommoditySpider.spiders' USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.81 Safari/537.36' ROBOTSTXT_OBEY = True DOWNLOAD_DELAY = 2 COOKIES_ENABLED = False DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8', 'Accept-Language': 'zh-CN,zh;q=0.8', 'Accept-Encoding': 'gzip, deflate, br', 'Content-Type': 'text/html;charset=UTF-8', 'Cache-Control': 'no-cache', } ITEM_PIPELINES = { 'CommoditySpider.aliexpresslines.pipelines.AliExpressPipeline': 300 } AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 3 AUTOTHROTTLE_MAX_DELAY = 60 AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 HTTPCACHE_ENABLED = True HTTPCACHE_EXPIRATION_SECS = 0 HTTPCACHE_DIR = 'httpcache' HTTPCACHE_IGNORE_HTTP_CODES = [] HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' DOWNLOADER_MIDDLEWARES = { # 第二行的填写规则 # yourproject.myMiddlewares(文件名).middleware类 # 设置 User-Agent 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': 400, } # 默认使用 IP 代理池 if IF_USE_PROXY: DOWNLOADER_MIDDLEWARES = { # 第二行的填写规则 # yourproject.myMiddlewares(文件名).middleware类 # 设置 User-Agent 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None, 'proxyPool.scrapy.RandomUserAgentMiddleware.RandomUserAgentMiddleware': 400, # 设置代理 'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': None, 'proxyPool.scrapy.middlewares.ProxyMiddleware': 100, # 设置自定义捕获异常中间层 'proxyPool.scrapy.middlewares.CatchExceptionMiddleware': 105, # 设置自定义重连中间件 'scrapy.contrib.downloadermiddleware.retry.RetryMiddleware': None, 'proxyPool.scrapy.middlewares.RetryMiddleware': 95, }```
可能是一些代理网站的 IP 都无效了,我近期花点时间优化下