rdflib
rdflib copied to clipboard
URLInputSource can be abused to retrieve arbitrary documents if used naïvely
Discussed in https://github.com/RDFLib/rdflib/discussions/1543
Originally posted by alexdutton July 20, 2021 This is mostly related to rdflib-jsonld, but the dereferencing implementation is in rdflib, hence raising it here.
Scenario
If a web service takes POSTed JSON-LD data, e.g. as part of a Linked Data Notifications implementation, rdflib will attempt to resolve any URL in the @context
. This can lead to:
- attackers being able to probe internal networks, by having rdflib request potential non-public URLs
- reflection attacks, if the same or slightly-different URLs are repeated multiple times in the
@context
- resource exhaustion, as the entire remote file is loaded into memory before JSON parsing is attempted (admittedly an rdflib-jsonld issue)
- denial of service, if web or task workers are tied up waiting for extended periods for HTTP requests to complete
- attackers being able to probe the local filesystem using
file://
URLs
Problem
rdflib provides no way to control how external references are resolved, nor a way to implement caching of external resources.
An implementor should be able to:
- add URLs to a safelist, if e.g. they only expect certain JSON-LD contexts to be used
- provide local copies of remote resources, to obviate needing to make HTTP requests
- hook in a caching mechanism
These things should either be possible directly, or there should be an obvious way to hook them in.
Resolution
A new Resolver
base class should be added that takes responsibility for resolving external references and returning InputSource
instances, probably encapsulating the create_input_source()
behaviour in a resolve()
method. There should be a default implementation that resolves everything called e.g. DefaultResolver
. Maybe this resolver has an instantiation parameter like resolve_schemes=('file', 'http', 'https')
so it's easy to turn off dereferencing.
An optional resolver
argument should be added to Graph.parse()
, so that implementors can override the default behaviour. This is then passed down to the Parser.parse()
plugin implementation, defaulting to an instance of DefaultResolver
if not specified.
Finally, rdflib-jsonld can be updated to use the resolver
instead of create_input_source
directly.
Maybe there should also be a way to install a global default resolver to easily implement these protections without having to track down every Graph.parse()
call.
Happy to put together a PR if/when an approach is agreed.
Converted back to an issue from discussion, this is tracked here: https://security.snyk.io/vuln/SNYK-PYTHON-RDFLIB-1324490
We should resolve this some way or another, either by making a change to rdflib or disputing the vulnerability.
"critical" label transfers from the original issue.
FTR, Alex Dutton had worked up an implementation that seems to follow your intention: https://github.com/alexdutton/rdflib/commits/fix/1369-custom-resolver, dunno if rebasing it to current master will save some time.
Any progress on this?
Getting pushback as to the advisability of using rdflib, based upon the lack of resolution to this issue.
"We should resolve this some way or another, either by making a change to rdflib or disputing the vulnerability." - agree.
Nobody is working on this at the moment. A draft PR was made for this, but it is not in a mergable state and also reduces the overall quality of RDFLib. We are open to PRs to fix this though and I understand it is a priority, but right now the best I can say is that I will look into it when I have time, and will try get it fixed before the next release which will be in october.
Thanks @aucampia for this update.
I would have had a go at a PR myself but I'm not familiar enough with inner workings and idiosyncrasies of rdflib to be able to quickly dive in.
Reading the snyk Issue, the fix to me would look like the ability to initialise rdflib [or at least the JSON-LD bits of it] in a secure mode that would not connect externally to URLs in any form; with ability to set one or more whitelist URLs that it would connect to.
I'm wondering if the original PR, although attempting something similar, was trying to cover too much ground.
For reference:
- The original PR was https://github.com/RDFLib/rdflib/pull/1385 - but it never went past draft and was essentially abandoned.
- This PR was rebased in https://github.com/RDFLib/rdflib/pull/1995 but rejected because of issues listed there
The PR did try to do a lot, but a lot has to be done, through it has to be done in a way that does not deteriorate the codebase or create liabilities, and there are some considerations outside of what the PR addressed that should also be addressed.
What would be very helpful for resolving this issue is some prior art of how this is addressed in other RDF processors, I will try and make a bit of a survey.
But I also think we should have an ADR for this before a published PR.
Hi, this is triggering a number of testing-autoremovals in debian, as there are a number of packages that depend on rdflib and this is a rather serious issue. I know work you on rdflib in 'volunteer' time, but would it be possible to give any sort of ETA for this?
I'm asking for it since I'm a bit concerned if it'd take too much time then it'd end up affecting a bunch of reverse-dependencies in next debian release, to the point that they'd end up being excluded, and hence the ping.
I know work you on rdflib in 'volunteer' time
The concept of "volunteer time" may be a bit ambiguous, but I work on RDFLib exclusively during personal time, not sure about other maintainers.
but would it be possible to give any sort of ETA for this?
I'm asking for it since I'm a bit concerned if it'd take too much time then it'd end up affecting a bunch of reverse-dependencies in next debian release, to the point that they'd end up being excluded, and hence the ping.
I will do my best to fix it before the next release and that is planned for middle of October, but I'm not sure what Debian's release schedule is, so would be good to understand if this timeline is very problematic.
On Fri, Aug 19, 2022 at 07:12:18AM -0700, Iwan Aucamp wrote:
I'm asking for it since I'm a bit concerned if it'd take too much time then it'd end up affecting a bunch of reverse-dependencies in next debian release, to the point that they'd end up being excluded, and hence the ping.
I will do my best to fix it before the next release and that is planned for middle of October,
Thank you :)
but I'm not sure what Debian's release schedule is, so would be good to understand if this timeline is very problematic.
Currently freeze is scheduled from next year Jan. Basically, if it isn't fixed before this years christmas, it'd be too late.
Currently freeze is scheduled from next year Jan. Basically, if it isn't fixed before this years christmas, it'd be too late.
However, rdflib with RC bugs creates a lot of noise in the Debian package pool and removes heaps of packages from Debian testing. So it would be great if this could be fixed rather sooner than later. So if you might have a patch or some pre-release we might be interested. Thanks a lot for working on rdflib, Andreas.
I have started working on this and I do expect I will be finished around the middle of October which is the target for the next release. I have been a bit busy with other things so I have had limited time to dedicate to rdflib maintenance.
quoting https://github.com/RDFLib/rdflib/discussions/2038#discussioncomment-3872799
I'm on leave next week (2022-W41), I'm going to finish whatever is needed for https://github.com/RDFLib/rdflib/issues/1844 and try and fix as many bugs as possible, and then make the release on 2022-10-23 if all goes well.
The release will be 6.3.0 as I won't include any breaking changes.
I'm on leave next week (2022-W41), I'm going to finish whatever is needed for https://github.com/RDFLib/rdflib/issues/1844 and try and fix as many bugs as possible, and then make the release on 2022-10-23 if all goes well.
Hi, Is there still under your radar? I am asking this since we are a bit over 23-10-2022 now.
I'm on leave next week (2022-W41), I'm going to finish whatever is needed for #1844 and try and fix as many bugs as possible, and then make the release on 2022-10-23 if all goes well.
Hi, Is there still under your radar? I am asking this since we are a bit over 23-10-2022 now.
Hi, workload at my day job is quite high, I'm trying but I really have very little spare capacity for now I can't really give you any good answers.
Am Thu, Oct 27, 2022 at 03:37:07PM -0700 schrieb Iwan Aucamp:
Hi, workload at my day job is quite high, I'm trying but I really have very little spare capacity for now I can't really give you any good answers. I know we are all volunteers but I'd like to stress the urgency of the problem. Can you get any help from other team members of rdflib? Kind regards, Andreas.
I have added a bunch of tests now, and have written a URI mapper and filter, but also, while trying to find the optimal place to hook these things in, I have found urllib.request.install_opener(opener)
Install an OpenerDirector instance as the default global opener. Installing an opener is only necessary if you want urlopen to use that opener; otherwise, simply call OpenerDirector.open() instead of urlopen(). The code does not check for a real OpenerDirector, and any class with the appropriate interface will work.
Given this I actually don't think this is a valid vulnerability, and instead is just a feature request, as anyone can easily install an opener that prevents everything reported in the vulnerability:
If a web service takes POSTed JSON-LD data, rdflib will attempt to resolve any URL in the @context. This can lead to:
- attackers being able to probe internal networks, by having rdflib request potential non-public URLs
- reflection attacks, if the same or slightly-different URLs are repeated multiple times in the @context
- resource exhaustion, as the entire remote file is loaded into memory before JSON parsing is attempted
- Denial of Service, if web or task workers are tied up waiting for extended periods for HTTP requests to complete
- attackers being able to probe the local filesystem using file:// URLs
I will still see how to hook in the remapping and filtering in a way that does not interfere with python's builtin mechanisms for customizing URL loading, but given standard documented python functional can be used to avoid all listed problems I will dispute the vulnerability.
I took this up with SYNK:
We will change the wording of the advisory to reflect the findings of the maintainer (the one who opened this ticket) regarding the built-in behavior of Python, and the recommendation of how to remediate the issues described by using the library in a safe way. We’ve decided to keep the advisory in place, since the RDFlib does provide URL parsing functionality and it is therefore valuable to users to be aware of the possible risk of using it naively. We will be happy to include a reference to a warning in the relevant documentation if they’d like to add one, or to the type hints that have been added to the library, which also steer the user in the direction of safe usage.
A bit oddly phrased, I guess they want us to include a warning. I will do that, we may consider functionality for URL re-direction, but with a combination of urllib.request.install_opener and sys.addaudithook users can already prevent opening of malicious URLs. Furthermore, if you are actually running untrusted inputs, things like firejail/docker and normal firewalling should be used instead.
If the available options are not sufficent we would need a specific use case so we can fix the specific problem.
@alexdutton as you originally opened this I would appreciate if you can coordinate communication with SYNK further, it really is an incredibly poor experience as I have to raise a support request against their commercial product and then play broken telephone via their commercial 1st line support.
@alexdutton could urllib.request.urlopen not also be used to "be abused to retrieve arbitrary documents if used naïvely", and should you not raise similar advisories against python? I would say both URLInputSource and urlopen retrieve arbitrary documents by design, so probably the issues should be more specific to JSON-LD context handling.
Further communication from SYNK:
We are still discussing this further with our Security team and shall keep you posted accordingly.
Stay tuned !
Really not a great experience, I get they have no incentive to offer a good experience to library maintainers, so I would appreciate any help you can offer here @alexdutton.
I'm going to merge https://github.com/RDFLib/rdflib/pull/2270 soon, which I will consider as closing this matter. All that does is add documentation.
Okay, https://security.snyk.io/vuln/SNYK-PYTHON-RDFLIB-1324490 was updated
Amendment This was deemed not a vulnerability.
Great news!
Thanks to all that persevered to bring this to a conclusion