jsoup
jsoup copied to clipboard
Support `file:/...` URLs for local testing
I'm writing unit tests for an application that uses JSoup, and I would like to write the unit tests in terms of local files mirroring a website rather than the website itself.
As a mitigation, I could write my tests in terms of Files rather than URLs, but this would require substantial refactoring in order to accomplish. Could we please add support for protocols like file:/...
?
Trace:
URL: file:/Users/user/Desktop/src/jsoupcrawler/target/test-classes/google/index.html
java.net.MalformedURLException: Only http & https protocols supported
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:417)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at orion.core.data.JSoupCrawler.recursiveCrawl(JSoupCrawler.java:66)
at orion.core.data.JSoupCrawler.recursiveCrawl(JSoupCrawler.java:19)
at orion.core.data.JSoupCrawlerTest.testRecursiveCrawl(JSoupCrawlerTest.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)
Ah, could we make this line more general:
https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/Jsoup.java#L72
Instead of necessarily grabbing an HTTP connection, could we grab a connection based on the URL's stated protocol, e.g. file:/...
, ftp://...
, etc.?
I'm mostly concerned with the potential security ramifications here. I.e. you could get tricked into reading file:///etc/passwd. I think it'd need to be an option to enable in Jsoup.connect, potentially with a callback to confirm OK to load. Generally I like the idea. I don't think ftp:// is likely though.
I think it'd need to be an option to enable in Jsoup.connect
Sounds good.
Is anyone working on this @jhy? thinking of giving it a try.
@jjpatel361 it's all yours! :)
@jhy PR submitted for initial review. #921 Please provide your feedback on this one.
@jhy Any feedback on this PR?
This is a REAL BAD IDEA from a security POV and I suggest you do NOT do this at all.