jsoup icon indicating copy to clipboard operation
jsoup copied to clipboard

Support `file:/...` URLs for local testing

Open apennebaker opened this issue 11 years ago • 8 comments

I'm writing unit tests for an application that uses JSoup, and I would like to write the unit tests in terms of local files mirroring a website rather than the website itself.

As a mitigation, I could write my tests in terms of Files rather than URLs, but this would require substantial refactoring in order to accomplish. Could we please add support for protocols like file:/...?

Trace:

URL: file:/Users/user/Desktop/src/jsoupcrawler/target/test-classes/google/index.html
java.net.MalformedURLException: Only http & https protocols supported
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:417)
    at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
    at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
    at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
    at orion.core.data.JSoupCrawler.recursiveCrawl(JSoupCrawler.java:66)
    at orion.core.data.JSoupCrawler.recursiveCrawl(JSoupCrawler.java:19)
    at orion.core.data.JSoupCrawlerTest.testRecursiveCrawl(JSoupCrawlerTest.java:53)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:53)
    at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
    at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
    at org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
    at org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
    at org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107)
    at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)

apennebaker avatar Sep 23 '13 13:09 apennebaker

Ah, could we make this line more general:

https://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/Jsoup.java#L72

Instead of necessarily grabbing an HTTP connection, could we grab a connection based on the URL's stated protocol, e.g. file:/..., ftp://..., etc.?

apennebaker avatar Sep 23 '13 14:09 apennebaker

I'm mostly concerned with the potential security ramifications here. I.e. you could get tricked into reading file:///etc/passwd. I think it'd need to be an option to enable in Jsoup.connect, potentially with a callback to confirm OK to load. Generally I like the idea. I don't think ftp:// is likely though.

jhy avatar Nov 18 '13 03:11 jhy

I think it'd need to be an option to enable in Jsoup.connect

Sounds good.

apennebaker avatar Nov 18 '13 15:11 apennebaker

Is anyone working on this @jhy? thinking of giving it a try.

jjpatel361 avatar Jun 27 '17 00:06 jjpatel361

@jjpatel361 it's all yours! :)

jhy avatar Jun 27 '17 03:06 jhy

@jhy PR submitted for initial review. #921 Please provide your feedback on this one.

jjpatel361 avatar Jul 04 '17 01:07 jjpatel361

@jhy Any feedback on this PR?

jjpatel361 avatar Oct 01 '17 16:10 jjpatel361

This is a REAL BAD IDEA from a security POV and I suggest you do NOT do this at all.

jmanico avatar Apr 12 '21 18:04 jmanico