cola
cola copied to clipboard
A high-level distributed crawling framework.
在抓取信息的时候暂时还不能判断是否是企业帐号,如果是企业帐号的话信息就会为空了~应该是少了个if神马的  有时还会直接catch get 1908349515 url: http://weibo.com/1908349515/info Error when handle bundle: 1908349515, url: http://weibo.com/1908349515/info ValidationError (WeiboUser:54888bf6c95f801b60bce315) (site.Invalid URL: http://w eibo.com/376765750 http://weibo.com/linuxde: ['info']) Traceback (most recent call last): File "D:\cola\cola\job\executor.py", line...
抓取微博的时候不知道为什么 parsers的176行会报错 mblog.created = parse(div.select('a.S_link2.WB_time')[0]['title']) 以下是错误信息 D:\cola\contrib\weibo>init.py D:\cola\cola\core\opener.py:108: UserWarning: gzip transfer encoding is experimental! self.browser.set_handle_gzip(True) start to process priority: 0 process bundle from priority 0 get 3211200050 url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=1&uid=3211200050&end_id=3786010796038435&_t=0&_k=1418233717575000&__rnd=1418233835289&pagebar=0&max_id=3751405376185938&page=1 D:\cola\cola\core\opener.py:108:...
出错提示: Error when handle bundle: 1644564144, url: http://weibo.com/aj/mblog/mbloglist?count=15&pre_page=2&uid=1644564144&end_id=3778591994786968&_t=0&_k=1416407703559230&__rnd=1416407854641&pagebar=1&max_id=3768747372034188&page=2 'NoneType' object has no attribute 'text' Traceback (most recent call last): File "/home/kqc/github/cola/cola/job/executor.py", line 504, in _parse_with_process_exception res = self._parse(parser_cls, options, bundle,...
我是通过selenium模拟登陆获取cookie来抓取新浪微博内容的,感觉不太优雅,不知道您采用的是什么方式?如果能对代码进行注释就跟好了。
你好,感谢你提供了这样的一个框架,It helps a lot。 我注意到你把微博抓取的instances设置为2,且由于 ``` python # cola/worker/loader.py if master is None: with StandaloneWorkerJobLoader(job, root, force=force) as job_loader: job_loader.run() ``` ,全局只有2个线程在抓取微博。 我在做类似爬虫的时候触发了新浪的反爬虫机制,造成每次登录必须输入验证码的情况,原因估计是并发抓取的线程数太多(16个)。于是想问下你这个线程数是怎么得出来的。
### Description In the core logging functionality of the Cola framework (cola/core/logs.py), the LogRecordStreamHandler class directly uses pickle.loads() to deserialize messages received from TCP socket connections without any sanitization, which...