elasticsearch-river-web
elasticsearch-river-web copied to clipboard
Support protocol relative urls (url that inherith schema from parnt context)
Seems that "protocol relative urls" are not supported, they are quite common in today html sources and seems that all browsers supports it.
reference: http://www.paulirish.com/2010/the-protocol-relative-url/
[2015-04-05 16:29:27,270][INFO ][org.codelibs.robot.helper.impl.LogHelperImpl] Crawling URL: https://www.4chan.org/frames
[2015-04-05 16:29:27,869][WARN ][org.codelibs.robot.transformer.impl.HtmlTransformer] Could not create child urls.
java.net.MalformedURLException: no protocol: //www.4chan.org
at java.net.URL.<init>(URL.java:586)
at java.net.URL.<init>(URL.java:483)
at java.net.URL.<init>(URL.java:432)
at org.codelibs.robot.transformer.impl.HtmlTransformer.storeChildUrls(HtmlTransformer.java:242)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeChildUrls(ScrapingTransformer.java:706)
at org.codelibs.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:169)
at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.transform(ScrapingTransformer.java:108)
at org.codelibs.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:75)
at org.codelibs.robot.S2RobotThread.processResponse(S2RobotThread.java:401)
at org.codelibs.robot.S2RobotThread.run(S2RobotThread.java:190)
at java.lang.Thread.run(Thread.java:745)