webmagic
webmagic copied to clipboard
A scalable web crawler framework for Java.
Hi, This is a question, not a bug report. [url-frontier](https://github.com/crawler-commons/url-frontier) is an API to define a [crawl frontier](https://en.wikipedia.org/wiki/Crawl_frontier). It uses gRPC and has a service implementation. It is crawler-neutral and...
``` java public void execute(final Runnable runnable) { if (threadAlive.get() >= threadNum) { try { reentrantLock.lock(); while (threadAlive.get() >= threadNum) { try { condition.await(); } catch (InterruptedException e) { }...
@sutra 如何使用0.7.5是没有这样的问题的,升级了版本后就出现这个问题了   
我发现可以对Site对象设置代理 ![Uploading image.png…]() 也可以对downloader设置代理 ```java HttpClientDownloader downloader = new HttpClientDownloader(); downloader.setProxyProvider(SimpleProxyProvider.from( new Proxy("127.0.0.1", 7890) )); Spider.create(this) .addUrl(starturl) .addPipeline(new XiannvPipeline()) .setDownloader(downloader) .runAsync(); ``` 我的问题是,这两个方法等效吗,为什么要这样设置?
我发现可以对Site对象设置代理  也可以对downloader设置代理 ```java HttpClientDownloader downloader = new HttpClientDownloader(); downloader.setProxyProvider(SimpleProxyProvider.from( new Proxy("127.0.0.1", 7890) )); Spider.create(this) .addUrl(starturl) .addPipeline(new XiannvPipeline()) .setDownloader(downloader) .runAsync(); ``` 我的问题是,这两个方法等效吗,为什么要这样设置?
0.10.3版本情况下,如果对方是 shtml页面,报错 process request Request{url='https://www.prlife.com.cn/page/message/base/product/list/product_help_list.shtml', method='null', extras=null, priority=0, headers={}, cookies={}} error java.lang.NullPointerException: null at java.util.regex.Matcher.getTextLength(Matcher.java:1283) at java.util.regex.Matcher.reset(Matcher.java:309) at java.util.regex.Matcher.(Matcher.java:229) at java.util.regex.Pattern.matcher(Pattern.java:1093) at us.codecraft.webmagic.utils.UrlUtils.getCharset(UrlUtils.java:119) at us.codecraft.webmagic.utils.CharsetUtils.detectCharset(CharsetUtils.java:28) at us.codecraft.webmagic.downloader.HttpClientDownloader.getHtmlCharset(HttpClientDownloader.java:128) at us.codecraft.webmagic.downloader.HttpClientDownloader.handleResponse(HttpClientDownloader.java:112) at...
调整重试逻辑
修改download成功后对状态码的处理,如果状态码不被site.acceptStatCode接收的话就算失败,进行doCycleRetry重试逻辑。并且加入队列前先sleep。
`Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE at org.apache.http.conn.ssl.SSLConnectionSocketFactory.(SSLConnectionSocketFactory.java:146) at us.codecraft.webmagic.downloader.HttpClientGenerator.buildSSLConnectionSocketFactory(HttpClientGenerator.java:52) at us.codecraft.webmagic.downloader.HttpClientGenerator.(HttpClientGenerator.java:44) at us.codecraft.webmagic.downloader.HttpClientDownloader.(HttpClientDownloader.java:40) at us.codecraft.webmagic.Spider.initComponent(Spider.java:280) at us.codecraft.webmagic.Spider.run(Spider.java:305) at com.mufeng.spider.GithubRepoPageProcessor.main(GithubRepoPageProcessor.java:30)`
如何触发重试
当在Process解析返回的参数解析时,根据需求如何触发进行重试机制。请问