elasticsearch-river-web icon indicating copy to clipboard operation
elasticsearch-river-web copied to clipboard

Support protocol relative urls (url that inherith schema from parnt context)

Open ducktype opened this issue 10 years ago • 0 comments

Seems that "protocol relative urls" are not supported, they are quite common in today html sources and seems that all browsers supports it.

reference: http://www.paulirish.com/2010/the-protocol-relative-url/

[2015-04-05 16:29:27,270][INFO ][org.codelibs.robot.helper.impl.LogHelperImpl] Crawling URL: https://www.4chan.org/frames
[2015-04-05 16:29:27,869][WARN ][org.codelibs.robot.transformer.impl.HtmlTransformer] Could not create child urls.
java.net.MalformedURLException: no protocol: //www.4chan.org
        at java.net.URL.<init>(URL.java:586)
        at java.net.URL.<init>(URL.java:483)
        at java.net.URL.<init>(URL.java:432)
        at org.codelibs.robot.transformer.impl.HtmlTransformer.storeChildUrls(HtmlTransformer.java:242)
        at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.storeChildUrls(ScrapingTransformer.java:706)
        at org.codelibs.robot.transformer.impl.HtmlTransformer.transform(HtmlTransformer.java:169)
        at org.codelibs.elasticsearch.web.robot.transformer.ScrapingTransformer.transform(ScrapingTransformer.java:108)
        at org.codelibs.robot.processor.impl.DefaultResponseProcessor.process(DefaultResponseProcessor.java:75)
        at org.codelibs.robot.S2RobotThread.processResponse(S2RobotThread.java:401)
        at org.codelibs.robot.S2RobotThread.run(S2RobotThread.java:190)
        at java.lang.Thread.run(Thread.java:745)

ducktype avatar Apr 05 '15 16:04 ducktype