webmagic issues

如何关闭已经入栈的请求

1

描述 1、目前项目结合xxl-job https://github.com/xuxueli/xxl-job 进行任务管理，开始执行任务的时候一次把所有目标URL入栈，进行spider.run() 后，运行了几个URL后发现不是想要的结果需要停止本次爬取，这种场景下如何收到关闭爬虫？

muchengyang

爬取https网站，报错javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure

5

**爬取的网站：** https://contests.covers.com/Consensus/TopConsensus/NBA/Overall/2016-11-28 **报错信息如下：** ``` 18-10-13 11:33:24,597 INFO org.springframework.beans.factory.xml.XmlBeanDefinitionReader(XmlBeanDefinitionReader.java:317) ## Loading XML bean definitions from file [/Users/peichenchen/Documents/gitCode/nbaAnalyze/target/classes/spring/applicationContext.xml] trigger seeding of SecureRandom done seeding SecureRandom 18-10-13 11:33:26,125 INFO us.codecraft.webmagic.Spider(Spider.java:306) ## Spider contests.covers.com...

peichenchen

请问如何通过网页内容返回该内容所处的网页结构xpath信息呢

4

比如我定位某个图片src信息是xxx.jpg，然后我如何得到xxx.jpg的完整定位页面结构位置，类似下面 /html/div/div/img[2]

cwtree

webmagic如何取得跳转后的url

9

比如说请求http://baike.baidu.com/subview/38681/5279942.htm，然后取得跳转后的http://baike.baidu.com/item/%E9%82%93%E8%B6%85/5681

tonglin0325

enhancement

更新内容

2

更新内容

apaqi

HttpClientDownloader中的httpClients为何要使用Map管理？

看HttpClientDownloader源码，有个疑问，请教一下。 httpClients根据site中的域名映射，但是site中的域名并不会自动更新，而且就算在运行中手动更新好像也不能保证URL是根据自己的域名获取HttpClient。难道使用Map管理是为了让多个Spider实例使用同一个HttpClientDownloader组件实例吗？这样不同的Spider实例中site配置不同的域名，获取不同的HttpClient。

KeShBo

Bug and Vulnerability fixes

1

Xenios91

抓取ajax渲染的页面，利用jsonPath报错，求解决方式！！！

6

Exception in thread "pool-1-thread-1" java.lang.NoSuchMethodError: com.jayway.jsonpath.JsonPath.compile(Ljava/lang/String;[Lcom/jayway/jsonpath/Filter;)Lcom/jayway/jsonpath/JsonPath;

dpjs

bug

toTest

改进:修改Page.addTargetRequests方法

1

将参数List requests改成Collection requests???

ningpp

如何捕获到下载时read time out的异常

3

java.net.SocketTimeoutException: Read timed out try { //设置爬虫信息 Spider detailSpider = Spider.create(new BaiDuExtendProcessor()) .addPipeline(new BaiDuExtendPipeline()) .addRequest(request); detailSpider.start(); }catch (Exception e){ //爬虫发生错误时日志处理 errorLogRepository.save(LogUtil.errorLog(webContentEntities,e.toString())); } }这样捕获不到

yanganzhen

webmagic
webmagic copied to clipboard

Metadata

如何关闭已经入栈的请求

爬取https网站，报错javax.net.ssl.SSLHandshakeException: Received fatal alert: handshake_failure

请问如何通过网页内容返回该内容所处的网页结构xpath信息呢

webmagic如何取得跳转后的url

更新内容

HttpClientDownloader中的httpClients为何要使用Map管理？

Bug and Vulnerability fixes

抓取ajax渲染的页面，利用jsonPath报错，求解决方式！！！

改进:修改Page.addTargetRequests方法

如何捕获到下载时read time out的异常

← Metadata

Owner

Metadata

webmagic webmagic copied to clipboard

Metadata

← Metadata

Owner

Metadata

webmagic
webmagic copied to clipboard