XMLResource allow='all' default can lead to SSRF
Today, the XMLResource object source can be a remote URL, local URL, file path, or string literal. When loaded, the XMLResource tries to parse the given source first as a URL, then as a Path (local file), then as a string literal*.
The new-user documentation on "Create a schema instance" does not mention this behavior except for the section "Creating a local copy of a remote XSD schema for offline use." The "Validation" documentation is similar.
A user may, roughly following the documentation, write the following:
import xmlschema
with open('local.xsd', 'r') as infile:
data = infile.read()
xmlschema.XMLSchema(data)
If local.xsd is indeed an XSD file, we have no problem. However, imagine local.xsd comes from an untrusted location. If a malicious actor could set the content of local.xsd to:
http://evil.com/myfavoritepayload
Then calling xmlschema.XMLSchema() will make a HTTP GET to that URL. This is Server-Side Request Forgery (SSRF).
This can be mitigated by using the allow setting:
import xmlschema
with open('local.xsd', 'r') as infile:
data = infile.read()
resource = xmlschema.XMLResource(
source=data,
allow='local',
)
xmlschema.XMLSchema(resource)
However, the default behavior has allow='all'.
Would you consider changing the default to allow='local' and releasing a new major version? I understand this would be a breaking change for end users, but I believe the pain of upgrading to a new major version is worth mitigating the risk of this default.
*Technically there is additional complexity with bytes-like and StringIO-like objects but that is not relevant for this discussion.
Hi,
I'm unsure about that. If one process locally an untrusted file is already an hazard for security. Also the default way for processing a local.xsd is to provide the filepath to XMLSchema:
import xmlschema
schema = xmlschema.XMLSchema('local.xsd')
That raises a XML parse error if local.xsd doesn't contain a valid XML (XSD) source.
Remote access of location hints can be done only by URL, not by file or file-like objects nor StringIO, and for default each remote access is defused before loading XML data. But certainly if a remote schema contains malicious location hint a potential SSRF can be addressed to another site, but only on other XSD resources.
The defaults of security options allow='all' and defuse='remote' was defined considering that the local system is safe. Also when you access remote locations the location normalization should avoid tentatives of local file accesses (but maybe bugs could be exist for particular cases, who knows). In order to increase local security other limits could be introduced on XML resources (e.g. maximum number of elements/attributes/namespaces/cdata, counted when iterparse processes the XML resource).
Changing allow='local' as default for XMLSchema would limit seriously the usability of the package in an open context where you need to load additional namespaces from remote, usually standard, sources. Changing the default of allow only for XMLResource has little sense because the value of this option is provided by parent schema.
So better to think about that and waiting other opinions or information on how other validators do for default.
Thanks for the feedback. I look forward to hearing other opinions as well.
The defaults of security options allow='all' and defuse='remote' was defined considering that the local system is safe.
I agree with this idea. Let me reframe my perspective.
As the developer of a system that performs XML validation, I must take a payload from an untrusted source. I could either send the content to xmlschema as a string, or save it as local.xml. Either way, it gives the attacker a chance to perform SSRF unless I set allow='local' or perform a prerequisite XML check using a different library.
If you do not want to change the allow default, would you be open to adding some context to the documentation mentioning the remote URL feature? I think that would help mitigate my concern in lieu of code changes.
Hi,
i made a lot of tentatives on that. The problem il that the XMLResource is built for being flexible on accepted source type, in order to improve the usability. But I understand that in open contexts the default config could be not the safer choice. Also another problem is that some parameters applicable for processing schema resources are not applicable to XML instances, some others are applicable to both situations.
So my choice has driven to develop a better control on arguments and options applicable in different contexts (XML resources, schemas, XML validated documents, maybe also converters soon), using dataclasses for creating validated settings class/instances. Maybe this will permits to change default configurations with different defaults.
About XMLResource I will apply MAX_XML_DEPTH package limit directly in XML parsing and add a MAX_XML_ELEMENT package limit to parsing of non-lazy resources. Also i think that should be a good idea to add a specific argument to limit the type of sources that can be processed (e.g.: block='none' permits all the types accepted, block='text' blocks only XSD data provided using a string,block='text file io url tree' blocks all the types).
If you do not want to change the
allowdefault, would you be open to adding some context to the documentation mentioning the remote URL feature? I think that would help mitigate my concern in lieu of code changes.
After that surely a contribution on docs will be very useful to better explain new features. But, if you can, also adding information about usage cases and security concerns with remote URLs will be much appreciated.
thanks
Hi, a new minor release is out, that contains improvements in checking arguments and limits.
In particular it could be useful examining the SchemaSettings dataclass, that is now the core for storing the options for building schemas:
https://xmlschema.readthedocs.io/en/latest/features.html#schema-settings
Also the package safe limits have been extended and now are verified if changed.
Some feedback about these new features, and an help on extending docs about security options ad in safe usage of the library would be very appreciated.
thanks