xxl-crawler
xxl-crawler copied to clipboard
A distributed web crawler framework.(分布式爬虫框架XXL-CRAWLER)
Bumps [jsoup](https://github.com/jhy/jsoup) from 1.11.2 to 1.14.2. Release notes Sourced from jsoup's releases. jsoup 1.14.2 Caught by the fuzz! jsoup 1.14.2 is out now, and includes a set of parser bug...
如题
JsoupUtil工具类loadPageSource()方法里Connection没有调用requestBody,有的接口要求只能通过Connection.requestBody()传递参数,这种情况下,抓取不到数据。
Bumps [junit](https://github.com/junit-team/junit4) from 4.11 to 4.13.1. Release notes Sourced from junit's releases. JUnit 4.13.1 Please refer to the release notes for details. JUnit 4.13 Please refer to the release notes...
当前`JsoupUtil.findLinks(html)`, 只支持获取`http`开头的地址,且无法自定义。在`RunData`添加方法,可以让用户自己扩展对`findUrls()`方法的实现。如bilibili视频地址没有`http`前缀: ``
Bumps [htmlunit](https://github.com/HtmlUnit/htmlunit) from 2.24 to 2.37.0. Release notes Sourced from htmlunit's releases. HtmlUnit-2.37.0 Bugfixes many js improvements done in Rhino CHROME 79 FF52 removed FF68 added HtmlUnit-2.36.0 Bugfixes many js...
使用SeleniumPhantomjsPageLoader后,jsoup解析后document对象中的baseUri为空
如何针对对某个url的connect timeout超时做出判断处理,或者重新加入待爬取内容
com.xuxueli.crawler.thread.CrawlerThread#processPage中以下代码应该return false比较合适吧? ```java if (!crawler.getRunConf().validWhiteUrl(pageRequest.getUrl())) { // limit unvalid-page parse, only allow spread child, finish here return true; } ```
请问一下,有登录后再爬取内容的功能吗?