webmagic issues

建议：给request指定下载成功后用什么方法执行

1

我在现在版本的基础上增加了一些建议：在Request上添加一个字段，用于自定在这个request下载html完成后用什么方法进行解析：实现的Request： ` import us.codecraft.webmagic.Request; public class MyRequest extends Request { public String tartMethod; public MyRequest(String url) { super(url); } public MyRequest(String url, String tartMethod) { this(url); this.tartMethod = tartMethod;...

FuckerDeng

关于代码的一些建议

如果要创建一个spider 然后考虑长期运行的话,需要有pageCount重置为0的问题，但是原代码是`private final AtomicLong pageCount = new AtomicLong(0);`，是否有考虑在`initComponent` 中把 pageCount 重置为0？还有`executorService` 这个变量，如果从外部传入，在结束的时候被强行shutdown,这是很危险的动作。`pipelines` 这个变量是线程不安全的，如果在运行过程中被修改了，是否会产生奇怪的异常，是否有考虑过`CopyOnWriteArrayList`? `HttpClientDownloader` 中的 `HttpClientGenerator` 是否考虑过让配置更加容易，毕竟httpClient是很重要的对象，如果想对httpClient进行配置，现在要花费不少的精力去重写class?

yaoqiangpersonal

Dependency conflicts on org.ow2.asm:asm, leading to inconsistent program behaviors

1

Hi, in **webmagic-WebMagic-0.7.3/webmagic-scripts**, there are mulptiple versions of library **org.ow2.asm:asm**. However, according to Maven's dependency management strategy: **_"first declaration wins"_**, only **org.ow2.asm:asm:4.0** can be loaded, and **org.ow2.asm:asm:5.0.4** will be shadowed....

HelloCoCooo

[建议] spider.run()方法调用线程处的一个优化建议

在spider.run()方法里while循环else部分，多线程下抓取过程中调用spider.close()，会导致报 java.util.concurrent.RejectedExecutionException 异常，while处的验证有概率情况下没管用，建议优化下体验，加个try...catch，本来没问题的，只是有点强迫症，不喜欢看到异常

7ye

enhancement

我觉得webmagic应该修改为可以与spring集成

1

我运行你的demo爬取github内容，报错Received fatal alert: protocol_version，查看你的解决方案后发现不够完美。你的解决方案要求重写HttpClientDownloader、HttpClientGenerator这两给类，我查看这两个类之后发现它非常不友好。不能通过IOC容器自动注入、自动配置。例如: HttpClientDownloader中的httpClientGenerator字段是一个private字段，并且没有提供该字段的setter、getter方法。你应该提供getter、setter方法，并对其他类做一些改造，这样就可以通过spring等IOC容器来进行自动配置了。

s-wangc

org.apache.http.annotation.ThreadSafe的类文件

1

org.apache.httpcomponents httpcore 4.4.4 项目中使用的httpcore 使用了4.4.5版本该注解 ThreadSafe应该废弃，麻烦更新一个新的版本到maven 本地项目得排除你的httpclient,然后降低httpcore的版本 us.codecraft webmagic-core 0.7.3 org.slf4j slf4j-log4j12 org.apache.httpcomponents httpclient org.apache.httpcomponents httpclient 4.5.2 org.apache.httpcomponents httpcore 4.4.4

wanjinji

求邀请入群谢谢

9

我的QQ1991997728，万分感谢~

CrowsStriker

selenium驱动chrome的替代方案，使用Puppeteer的java版Jvppeteer

可以使用大佬的Jvppeteer（Puppeteer的java版），地址，https://github.com/fanyong920/jvppeteer/，来替换。目前个人使用感觉比Selenium驱动chrome的省资源。

kzjxm

xpath not contains support

1

在 xpath 中使用not contains 语法不支持不知道有没有什么办法解决,在xsoup 0.3.1版本中的源码中不支持,可以添加支持或者有其他方法吗 eg: xpath "//*[@class='mod-play-list']/li[not(contains(@class,'item-hold')]/a" org.jsoup.select.Selector$SelectorParseException: Could not parse query 'li[not(contains(@class,'item-hold')]': unexpected token at 'not(contains(@class,'item-hold')' at us.codecraft.xsoup.xevaluator.XPathParser.byFunction(XPathParser.java:260) at us.codecraft.xsoup.xevaluator.XPathParser.consumePredicates(XPathParser.java:231) at us.codecraft.xsoup.xevaluator.XPathParser.findElements(XPathParser.java:163) at us.codecraft.xsoup.xevaluator.XPathParser.parse(XPathParser.java:76)...

freshgeek

获取的Html不全

2

最近发现有些网站页面获取不全，只能获取一部分，不知是什么原因，还有httpResponse.getEntity().getContentLength（）为什么值一直是-1

yjia0

webmagic
webmagic copied to clipboard

Metadata

建议：给request指定下载成功后用什么方法执行

关于代码的一些建议

Dependency conflicts on org.ow2.asm:asm, leading to inconsistent program behaviors

[建议] spider.run()方法调用线程处的一个优化建议

我觉得webmagic应该修改为可以与spring集成

org.apache.http.annotation.ThreadSafe的类文件

求邀请入群谢谢

selenium驱动chrome的替代方案，使用Puppeteer的java版Jvppeteer

xpath not contains support

获取的Html不全

← Metadata

Owner

Metadata

webmagic webmagic copied to clipboard

Metadata

← Metadata

Owner

Metadata

webmagic
webmagic copied to clipboard