Discussed in https://github.com/RDFLib/rdflib/discussions/1543

^{Originally posted by alexdutton July 20, 2021} This is mostly related to rdflib-jsonld, but the dereferencing implementation is in rdflib, hence raising it here.

Scenario

If a web service takes POSTed JSON-LD data, e.g. as part of a Linked Data Notifications implementation, rdflib will attempt to resolve any URL in the @context. This can lead to:

attackers being able to probe internal networks, by having rdflib request potential non-public URLs
reflection attacks, if the same or slightly-different URLs are repeated multiple times in the @context
resource exhaustion, as the entire remote file is loaded into memory before JSON parsing is attempted (admittedly an rdflib-jsonld issue)
denial of service, if web or task workers are tied up waiting for extended periods for HTTP requests to complete
attackers being able to probe the local filesystem using file:// URLs

Problem

rdflib provides no way to control how external references are resolved, nor a way to implement caching of external resources.

An implementor should be able to:

add URLs to a safelist, if e.g. they only expect certain JSON-LD contexts to be used
provide local copies of remote resources, to obviate needing to make HTTP requests
hook in a caching mechanism

These things should either be possible directly, or there should be an obvious way to hook them in.

Resolution

A new Resolver base class should be added that takes responsibility for resolving external references and returning InputSource instances, probably encapsulating the create_input_source() behaviour in a resolve() method. There should be a default implementation that resolves everything called e.g. DefaultResolver. Maybe this resolver has an instantiation parameter like resolve_schemes=('file', 'http', 'https') so it's easy to turn off dereferencing.

An optional resolver argument should be added to Graph.parse(), so that implementors can override the default behaviour. This is then passed down to the Parser.parse() plugin implementation, defaulting to an instance of DefaultResolver if not specified.

Finally, rdflib-jsonld can be updated to use the resolver instead of create_input_source directly.

Maybe there should also be a way to install a global default resolver to easily implement these protections without having to track down every Graph.parse() call.

Happy to put together a PR if/when an approach is agreed.

Apr 17 '22 20:04 aucampia

Converted back to an issue from discussion, this is tracked here: https://security.snyk.io/vuln/SNYK-PYTHON-RDFLIB-1324490

We should resolve this some way or another, either by making a change to rdflib or disputing the vulnerability.

Apr 17 '22 20:04 aucampia

"critical" label transfers from the original issue.

Apr 17 '22 21:04 aucampia

FTR, Alex Dutton had worked up an implementation that seems to follow your intention: https://github.com/alexdutton/rdflib/commits/fix/1369-custom-resolver, dunno if rebasing it to current master will save some time.

Apr 18 '22 10:04 ghost

Any progress on this?

Getting pushback as to the advisability of using rdflib, based upon the lack of resolution to this issue.

"We should resolve this some way or another, either by making a change to rdflib or disputing the vulnerability." - agree.

Jul 25 '22 15:07 RichardWallis

Nobody is working on this at the moment. A draft PR was made for this, but it is not in a mergable state and also reduces the overall quality of RDFLib. We are open to PRs to fix this though and I understand it is a priority, but right now the best I can say is that I will look into it when I have time, and will try get it fixed before the next release which will be in october.

Jul 26 '22 10:07 aucampia

Thanks @aucampia for this update.

I would have had a go at a PR myself but I'm not familiar enough with inner workings and idiosyncrasies of rdflib to be able to quickly dive in.

Reading the snyk Issue, the fix to me would look like the ability to initialise rdflib [or at least the JSON-LD bits of it] in a secure mode that would not connect externally to URLs in any form; with ability to set one or more whitelist URLs that it would connect to.

I'm wondering if the original PR, although attempting something similar, was trying to cover too much ground.

Jul 28 '22 08:07 RichardWallis

For reference:

The original PR was https://github.com/RDFLib/rdflib/pull/1385 - but it never went past draft and was essentially abandoned.
This PR was rebased in https://github.com/RDFLib/rdflib/pull/1995 but rejected because of issues listed there

The PR did try to do a lot, but a lot has to be done, through it has to be done in a way that does not deteriorate the codebase or create liabilities, and there are some considerations outside of what the PR addressed that should also be addressed.

Jul 29 '22 02:07 aucampia

What would be very helpful for resolving this issue is some prior art of how this is addressed in other RDF processors, I will try and make a bit of a survey.

But I also think we should have an ADR for this before a published PR.

Jul 29 '22 02:07 aucampia

Hi, this is triggering a number of testing-autoremovals in debian, as there are a number of packages that depend on rdflib and this is a rather serious issue. I know work you on rdflib in 'volunteer' time, but would it be possible to give any sort of ETA for this?

I'm asking for it since I'm a bit concerned if it'd take too much time then it'd end up affecting a bunch of reverse-dependencies in next debian release, to the point that they'd end up being excluded, and hence the ping.

Aug 19 '22 12:08 nileshpatra

I know work you on rdflib in 'volunteer' time

The concept of "volunteer time" may be a bit ambiguous, but I work on RDFLib exclusively during personal time, not sure about other maintainers.

but would it be possible to give any sort of ETA for this?

I'm asking for it since I'm a bit concerned if it'd take too much time then it'd end up affecting a bunch of reverse-dependencies in next debian release, to the point that they'd end up being excluded, and hence the ping.

I will do my best to fix it before the next release and that is planned for middle of October, but I'm not sure what Debian's release schedule is, so would be good to understand if this timeline is very problematic.

Aug 19 '22 14:08 aucampia

On Fri, Aug 19, 2022 at 07:12:18AM -0700, Iwan Aucamp wrote:

I'm asking for it since I'm a bit concerned if it'd take too much time then it'd end up affecting a bunch of reverse-dependencies in next debian release, to the point that they'd end up being excluded, and hence the ping.

I will do my best to fix it before the next release and that is planned for middle of October,

Thank you :)

but I'm not sure what Debian's release schedule is, so would be good to understand if this timeline is very problematic.

Currently freeze is scheduled from next year Jan. Basically, if it isn't fixed before this years christmas, it'd be too late.

Aug 19 '22 14:08 nileshpatra

Currently freeze is scheduled from next year Jan. Basically, if it isn't fixed before this years christmas, it'd be too late.

However, rdflib with RC bugs creates a lot of noise in the Debian package pool and removes heaps of packages from Debian testing. So it would be great if this could be fixed rather sooner than later. So if you might have a patch or some pre-release we might be interested. Thanks a lot for working on rdflib, Andreas.

Sep 29 '22 13:09 tillea

I have started working on this and I do expect I will be finished around the middle of October which is the target for the next release. I have been a bit busy with other things so I have had limited time to dedicate to rdflib maintenance.

Sep 29 '22 18:09 aucampia

quoting https://github.com/RDFLib/rdflib/discussions/2038#discussioncomment-3872799

I'm on leave next week (2022-W41), I'm going to finish whatever is needed for https://github.com/RDFLib/rdflib/issues/1844 and try and fix as many bugs as possible, and then make the release on 2022-10-23 if all goes well.

The release will be 6.3.0 as I won't include any breaking changes.

Oct 13 '22 18:10 aucampia

I'm on leave next week (2022-W41), I'm going to finish whatever is needed for https://github.com/RDFLib/rdflib/issues/1844 and try and fix as many bugs as possible, and then make the release on 2022-10-23 if all goes well.

Hi, Is there still under your radar? I am asking this since we are a bit over 23-10-2022 now.

Oct 27 '22 19:10 nileshpatra

I'm on leave next week (2022-W41), I'm going to finish whatever is needed for #1844 and try and fix as many bugs as possible, and then make the release on 2022-10-23 if all goes well.

Hi, Is there still under your radar? I am asking this since we are a bit over 23-10-2022 now.

Hi, workload at my day job is quite high, I'm trying but I really have very little spare capacity for now I can't really give you any good answers.

Oct 27 '22 22:10 aucampia

Am Thu, Oct 27, 2022 at 03:37:07PM -0700 schrieb Iwan Aucamp:

Hi, workload at my day job is quite high, I'm trying but I really have very little spare capacity for now I can't really give you any good answers. I know we are all volunteers but I'd like to stress the urgency of the problem. Can you get any help from other team members of rdflib? Kind regards, Andreas.

Oct 28 '22 04:10 tillea

I have added a bunch of tests now, and have written a URI mapper and filter, but also, while trying to find the optimal place to hook these things in, I have found urllib.request.install_opener(opener)

Install an OpenerDirector instance as the default global opener. Installing an opener is only necessary if you want urlopen to use that opener; otherwise, simply call OpenerDirector.open() instead of urlopen(). The code does not check for a real OpenerDirector, and any class with the appropriate interface will work.

Given this I actually don't think this is a valid vulnerability, and instead is just a feature request, as anyone can easily install an opener that prevents everything reported in the vulnerability:

If a web service takes POSTed JSON-LD data, rdflib will attempt to resolve any URL in the @context. This can lead to:

attackers being able to probe internal networks, by having rdflib request potential non-public URLs

reflection attacks, if the same or slightly-different URLs are repeated multiple times in the @context

resource exhaustion, as the entire remote file is loaded into memory before JSON parsing is attempted

Denial of Service, if web or task workers are tied up waiting for extended periods for HTTP requests to complete

attackers being able to probe the local filesystem using file:// URLs

I will still see how to hook in the remapping and filtering in a way that does not interfere with python's builtin mechanisms for customizing URL loading, but given standard documented python functional can be used to avoid all listed problems I will dispute the vulnerability.

Feb 18 '23 16:02 aucampia

I took this up with SYNK:

We will change the wording of the advisory to reflect the findings of the maintainer (the one who opened this ticket) regarding the built-in behavior of Python, and the recommendation of how to remediate the issues described by using the library in a safe way. We’ve decided to keep the advisory in place, since the RDFlib does provide URL parsing functionality and it is therefore valuable to users to be aware of the possible risk of using it naively. We will be happy to include a reference to a warning in the relevant documentation if they’d like to add one, or to the type hints that have been added to the library, which also steer the user in the direction of safe usage.

A bit oddly phrased, I guess they want us to include a warning. I will do that, we may consider functionality for URL re-direction, but with a combination of urllib.request.install_opener and sys.addaudithook users can already prevent opening of malicious URLs. Furthermore, if you are actually running untrusted inputs, things like firejail/docker and normal firewalling should be used instead.

If the available options are not sufficent we would need a specific use case so we can fix the specific problem.

@alexdutton as you originally opened this I would appreciate if you can coordinate communication with SYNK further, it really is an incredibly poor experience as I have to raise a support request against their commercial product and then play broken telephone via their commercial 1st line support.

Mar 05 '23 11:03 aucampia

@alexdutton could urllib.request.urlopen not also be used to "be abused to retrieve arbitrary documents if used naïvely", and should you not raise similar advisories against python? I would say both URLInputSource and urlopen retrieve arbitrary documents by design, so probably the issues should be more specific to JSON-LD context handling.

Mar 05 '23 11:03 aucampia

Further communication from SYNK:

We are still discussing this further with our Security team and shall keep you posted accordingly.

Stay tuned !

Really not a great experience, I get they have no incentive to offer a good experience to library maintainers, so I would appreciate any help you can offer here @alexdutton.

I'm going to merge https://github.com/RDFLib/rdflib/pull/2270 soon, which I will consider as closing this matter. All that does is add documentation.

Mar 12 '23 22:03 aucampia

Okay, https://security.snyk.io/vuln/SNYK-PYTHON-RDFLIB-1324490 was updated

Amendment This was deemed not a vulnerability.

Mar 21 '23 21:03 aucampia

Great news!

Thanks to all that persevered to bring this to a conclusion

Mar 22 '23 10:03 RichardWallis

rdflib
rdflib copied to clipboard

URLInputSource can be abused to retrieve arbitrary documents if used naïvely

Discussed in https://github.com/RDFLib/rdflib/discussions/1543

Scenario

Problem

Resolution

rdflib rdflib copied to clipboard

URLInputSource can be abused to retrieve arbitrary documents if used naïvely

Discussed in https://github.com/RDFLib/rdflib/discussions/1543

Scenario

Problem

Resolution

rdflib
rdflib copied to clipboard