webmagic icon indicating copy to clipboard operation
webmagic copied to clipboard

A scalable web crawler framework for Java.

Results 147 webmagic issues
Sort by recently updated
recently updated
newest added

Hi, This is a question, not a bug report. [url-frontier](https://github.com/crawler-commons/url-frontier) is an API to define a [crawl frontier](https://en.wikipedia.org/wiki/Crawl_frontier). It uses gRPC and has a service implementation. It is crawler-neutral and...

``` java public void execute(final Runnable runnable) { if (threadAlive.get() >= threadNum) { try { reentrantLock.lock(); while (threadAlive.get() >= threadNum) { try { condition.await(); } catch (InterruptedException e) { }...

@sutra 如何使用0.7.5是没有这样的问题的,升级了版本后就出现这个问题了 ![image](https://user-images.githubusercontent.com/45236518/198176773-1a668c90-2755-4e01-9ece-abec23141bb7.png) ![image](https://user-images.githubusercontent.com/45236518/198177008-41a74946-2007-4041-a081-74e38766209d.png) ![image](https://user-images.githubusercontent.com/45236518/198177035-f6cb0a40-b917-41d6-9cd3-431de8991ffe.png)

我发现可以对Site对象设置代理 ![Uploading image.png…]() 也可以对downloader设置代理 ```java HttpClientDownloader downloader = new HttpClientDownloader(); downloader.setProxyProvider(SimpleProxyProvider.from( new Proxy("127.0.0.1", 7890) )); Spider.create(this) .addUrl(starturl) .addPipeline(new XiannvPipeline()) .setDownloader(downloader) .runAsync(); ``` 我的问题是,这两个方法等效吗,为什么要这样设置?

我发现可以对Site对象设置代理 ![image](https://user-images.githubusercontent.com/46296608/192174037-49aa0942-1d6d-453b-8b53-ca010e3f8669.png) 也可以对downloader设置代理 ```java HttpClientDownloader downloader = new HttpClientDownloader(); downloader.setProxyProvider(SimpleProxyProvider.from( new Proxy("127.0.0.1", 7890) )); Spider.create(this) .addUrl(starturl) .addPipeline(new XiannvPipeline()) .setDownloader(downloader) .runAsync(); ``` 我的问题是,这两个方法等效吗,为什么要这样设置?

0.10.3版本情况下,如果对方是 shtml页面,报错 process request Request{url='https://www.prlife.com.cn/page/message/base/product/list/product_help_list.shtml', method='null', extras=null, priority=0, headers={}, cookies={}} error java.lang.NullPointerException: null at java.util.regex.Matcher.getTextLength(Matcher.java:1283) at java.util.regex.Matcher.reset(Matcher.java:309) at java.util.regex.Matcher.(Matcher.java:229) at java.util.regex.Pattern.matcher(Pattern.java:1093) at us.codecraft.webmagic.utils.UrlUtils.getCharset(UrlUtils.java:119) at us.codecraft.webmagic.utils.CharsetUtils.detectCharset(CharsetUtils.java:28) at us.codecraft.webmagic.downloader.HttpClientDownloader.getHtmlCharset(HttpClientDownloader.java:128) at us.codecraft.webmagic.downloader.HttpClientDownloader.handleResponse(HttpClientDownloader.java:112) at...

修改download成功后对状态码的处理,如果状态码不被site.acceptStatCode接收的话就算失败,进行doCycleRetry重试逻辑。并且加入队列前先sleep。

`Exception in thread "main" java.lang.NoSuchFieldError: INSTANCE at org.apache.http.conn.ssl.SSLConnectionSocketFactory.(SSLConnectionSocketFactory.java:146) at us.codecraft.webmagic.downloader.HttpClientGenerator.buildSSLConnectionSocketFactory(HttpClientGenerator.java:52) at us.codecraft.webmagic.downloader.HttpClientGenerator.(HttpClientGenerator.java:44) at us.codecraft.webmagic.downloader.HttpClientDownloader.(HttpClientDownloader.java:40) at us.codecraft.webmagic.Spider.initComponent(Spider.java:280) at us.codecraft.webmagic.Spider.run(Spider.java:305) at com.mufeng.spider.GithubRepoPageProcessor.main(GithubRepoPageProcessor.java:30)`

当在Process解析返回的参数解析时,根据需求如何触发进行重试机制。请问