xmlschema
xmlschema copied to clipboard
Change mapping of nullable types for `schema.to_dict()`
Background
In schema.to_dict()
, the parsing of nullable types and empty strings does not follow the "natural mapping".
Also, both XML and python support empty strings.
However, currently the XML empty string maps to python None. While the XML nil maps to a (rather non-pydantic) dictinary construct in python.
Example
import xmlschema
from pprint import pprint
xsd = """<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="note">
<xs:complexType>
<xs:sequence>
<xs:element name="emptystring" type="xs:string"/>
<xs:element name="nillstring" type="xs:string" nillable="true"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
"""
xml = """<?xml version="1.0" encoding="UTF-8"?>
<note xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<emptystring/>
<nillstring xsi:nil="true"/>
</note>
"""
schema = xmlschema.XMLSchema(xsd)
dict = schema.to_dict(xml)
pprint(dict)
Actual output:
{'@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'emptystring': None,
'nillstring': {'@xsi:nil': 'true'}}
Suggested output:
{'@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'emptystring': '',
'nillstring': None}
Proposal
I would suggest to change the parser mapping as follows:
Case | xml | python (current) | python (suggested) |
---|---|---|---|
Empty | <emptystring/> |
emptystring = None |
emptystring = "" |
Null | <nillstring xsi:nil="true"/> |
'nillstring': {'@xsi:nil': 'true'} |
'nilstring': None |
If this is found feasable in the first case, other datatypes should be discussed.
Meanwhile, do you have a good, efficient workaround for mapping <nillstring xsi:nil="true"/>
into Python None
?
Hi,
your case is pretty complex and 'xsi:nil' is only partly related with the case of emptystring.
In fact the current python decode of 'nillstring' is correct, because {'@xsi:nil': 'true'}
means that it is an element with an attribute but no content (otherwise it should be {'@xsi:nil': 'true', '$': ''}
). For having only 'nullstring': None
a custom decoder is needed (overriding the map_attributes()
method of the default converter).
The difficult part is about decoding/encoding of <emptystring/>
, for two reasons:
- Also ElementTree maps empty tag's text with
None
(so it seems to be normal to maintain the same decoding value); - Also the XPath standard consider
emptystring.text
not a text node (absent), so putting an empty text alter this when you want to rebuild the XML with an encode.
So these two things have to be discussed and understood before decide how to change the decode/encode process.
My proposal is to decode with ''
if this value is compatible with the schema type (no decode error, maybe ...). Encoding back restore the None
value for emptystring.text
.
Hi @brunato,
Thanks for the response, and thanks for a great library. This is easy to use and I havn't found any other python library that can parse XML while respecting XSD datatypes.
I don't agree that "python decode of 'nillstring' is correct" (and I don't follow your argument, perhaps you can elaborate). From a principal point of view, we have the concept of Null (meaning missing information/no value set) across all languages, they may be called null, nil, None, void. But it still represents the same thing.
I do understand the upsides of keeping the functionality close to similar libraries (ElementTree, XPath), although that doesn't necessariliy make it "right". I also understand that my suggestion could break the API, needing lots of discussion from community and a major release. So I don't expect this to happen over night. Maybe easier customization of the converter is doable in practice.
Anyway, arguing doesn't help me out. I am provided with XMLs containing <emptystring/>
and <nillstring xsi:nil="true"/>
and corresponding XSDs, without the ability to change that. They need to be converted to "pythonic" ""
and None
.
So I will try to customize the converter... Any hints?
I don't agree that "python decode of 'nillstring' is correct" (and I don't follow your argument, perhaps you can elaborate).
Converters shape the decoded data. For default complex elements are decoded to dictionaries where attributes are the items with keys prefixed by a @
, the text is mapped to the '$'
key, the children are mapped to other keys by name, same tags collapsed in a list.
From a principal point of view, we have the concept of Null (meaning missing information/no value set) across all languages, they may be called null, nil, None, void. But it still represents the same thing.
See this http://wiki.open311.org/JSON_and_XML_Conversion/
As you can see ('status_notes' field) empty tags are mapped in a various way, not necessarily a null value.
I do understand the upsides of keeping the functionality close to similar libraries (ElementTree, XPath), although that doesn't necessariliy make it "right".
XPath is a standard for XML, so i have to consider that its interpretation of an XML document should be the correct way.
I also understand that my suggestion could break the API, needing lots of discussion from community and a major release. So I don't expect this to happen over night. Maybe easier customization of the converter is doable in practice.
The impact is still to be evaluated and it should be none or very limited if we decide to implement this behaviour only in a custom converter.
Anyway, arguing doesn't help me out. I am provided with XMLs containing
<emptystring/>
and<nillstring xsi:nil="true"/>
and corresponding XSDs, without the ability to change that. They need to be converted to "pythonic"""
andNone
. So I will try to customize the converter... Any hints?
subclass XMLSchemaConverter and refactor element_decode()
method, filtering out '{http://www.w3.org/2001/XMLSchema-instance}nil'
from data.attributes
and change the parts that use data.text
.
More details on xsi:nil and XML Schema content types
Hi @th0ger,
xsi:nil is one of the four attributes defined within "http://www.w3.org/2001/XMLSchema-instance"
namespace. The meaning of these attributes is described in the XSD formal definition (Structures). In this document the xsi:nil
description is:
2.7.2 xsi:nil
XML Schema Definition Language: Structures introduces a mechanism for signaling that an element
must be accepted as valid when it has no content despite a content type which does not require or
even necessarily allow empty content. An element can be valid without content if it has the attribute
xsi:nil with the value true. An element so labeled must be empty, but can carry attributes if permitted
by the corresponding complex type.
moreover the XML Schema has 4 varieties of content types:
- empty: validates elements with no character or element information item (children)
- simple: validates elements with character-only children using its simple type definition
- element-only: validates elements with children that conform to the content model supplied by its particle
- mixed: validates elements with children that conform to the content model supplied by its particle, accepting character data children (not only spaces)
If xsi:nil="true"
the element is valid if and only if the content is empty, despite its effective XSD type.
Also default element values apply when elements are empty. So empty is not simply a value for an element (''), but a condition that is related with other schema options (and also with instance when you use XSI atttributes on it).
The condition of a None
or empty string value for a empty tag is IMHO related to validation against its type. Clearly if the type is an xs:int
the '' is not applicable, and i think a None
value should be better in this case.
Questionable if the type is xs:string
but the xs:minLength
is '1'. In this case a programmatic check could be based on an isinstance check.
In any case final decision of whether to map None
or empty string values could be delegated to a custom converter after the decode phase (this is the role of converters in xmlschema). For None
values also the filler
option can be used instead of a custom converter.
For empty tags ElementTree admits also an empty string for elem.text
instead of None
(depends by the parser).
In any case an empty string is not considered a text node (§6.7.1 of XQuery and XPath Data Model 3.1).
I'm still oriented to use the empty string value if there are not validation errors (lax decode mode), as already happens when there are validation errors on simple values.
I would add two options for decoding (to_dict/decode/iter_decode):
-
keep_empty: if set to
True
empty elements that are valid are decoded with an empty string value instead of aNone
-
element_hook: an optional function that is called with decoded element data before calling the converter decode method. Takes an
ElementData
instance plus optionally the XSD element and the XSD type, and returns a newElementData
instance.
The 1st option change the default behavior with empty elements, that are decoded to an empty string, instead of None
.
The 2nd is a general option that permits to change decoded element data with a different one (e.g. filter a specific attribute).
Anyway one can create a custom converter to do that, but this is not the usual role of converters, that are dedicated to shaping data, not on decoding/encoding (i will clean them also from checks of data.text !=''
).
The nillable elements, as stated by the XSD standard, are decoded with nil/null content but retain the attributes, so cases like your are generally not decoded to None
(e.g. the element has other attributes besides xsi:nil='true'
).
Looking at conventions nothing put in evidence that the empty string is the usual default for empty elements instead of None
, so changing this default behavior is questionable. Passing keep_empty='true'
change the behavior like you proposed and doesn't break back compatibility.
- keep_empty: if set to
True
empty elements that are valid are decoded with an empty string value instead of aNone
So that would apply only to string elements (as declared in the XSD). right?
- keep_empty: if set to
True
empty elements that are valid are decoded with an empty string value instead of aNone
So that would apply only to string elements (as declared in the XSD). right?
Yes, in practice with all XSD simple types that include the empty string in their value-space, so also simple strings (but not with derived strings that require a min string length > 0).
(that behavior is with 'lax'
validation or 'strict'
validation without errors, if you use 'skip'
validation the decoding errors are ignored and the empty string values are kept)
- keep_empty: if set to
True
empty elements that are valid are decoded with an empty string value instead of aNone
So that would apply only to string elements (as declared in the XSD). right?
Yes, in practice with all XSD simple types that include the empty string in their value-space, so also simple strings (but not with derived strings that require a min string length > 0).
(that behavior is with
'lax'
validation or'strict'
validation without errors, if you use'skip'
validation the decoding errors are ignored and the empty string values are kept)
Correcting this after a test: None
is used only if there is a decode error (e.g. decode ''
to int), otherwise the empty string is kept also if there is a validation error (e.g. a facet constraint is not valid).
Hi @th0ger,
the release v2.0.3 has the new options described above. For your reported case a possible usage can be:
schema = xmlschema.XMLSchema(xsd)
def filter_nil(element_data, *args):
if not element_data.attributes:
return element_data
return xmlschema.ElementData(
tag=element_data.tag,
text=element_data.text,
content=element_data.content,
attributes=[x for x in element_data.attributes
if x[0] != '{http://www.w3.org/2001/XMLSchema-instance}nil']
)
dict = schema.to_dict(xml, keep_empty=True, element_hook=filter_nil)
pprint(dict)
that has the output:
{'@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
'emptystring': '',
'nillstring': None}
@brunato thanks a bunch! I have tested it, and it works like a charm. The performance is also unaffected.
The performance is also unaffected.
It has been the occasion of realizing that in converters the emptiness checks on data.text
were unnecessary. Removing them balance the additional check for testing if elem.text
is empty/None
.
Adding element_hook
check has a little impact on decoding performance, but give a definitive option of doing complex customization on decoded data (if one knows what is doing ... ;) ).
In any case another optiion is to create a custom converter, but this shouldn't be the role of converters, that are mainly for data shaping, not on decoding or filtering the decoded data.
If this solution satisfy you please close the issue.
Changing the default decode behavior may not have a very wide consensus, and can be a harsh decision to take even in a new major release.
Consider that for empty elements a definitive choice between ''
on None
is uncertain, also for compatibility of decode conventions that are not related with an XSD schema.
Furthermore the nillable elements also retain attributes, so removing xsi:nil
don't assure that the element is decoded to a simple None
.
Thank you