xmlschema icon indicating copy to clipboard operation
xmlschema copied to clipboard

Change mapping of nullable types for `schema.to_dict()`

Open th0ger opened this issue 2 years ago • 1 comments

Background

In schema.to_dict(), the parsing of nullable types and empty strings does not follow the "natural mapping".

  • In XML, nulls are implemented as xsi:nil
  • In python, nulls are implemented as None

Also, both XML and python support empty strings.

However, currently the XML empty string maps to python None. While the XML nil maps to a (rather non-pydantic) dictinary construct in python.

Example

import xmlschema
from pprint import pprint

xsd = """<?xml version="1.0"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:element name="note">
  <xs:complexType>
    <xs:sequence>
      <xs:element name="emptystring" type="xs:string"/>
      <xs:element name="nillstring" type="xs:string" nillable="true"/>
    </xs:sequence>
  </xs:complexType>
</xs:element>
</xs:schema>
"""

xml = """<?xml version="1.0" encoding="UTF-8"?>
<note xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <emptystring/>
  <nillstring xsi:nil="true"/>
</note>
"""

schema = xmlschema.XMLSchema(xsd)
dict = schema.to_dict(xml)
pprint(dict)

Actual output:

{'@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
 'emptystring': None,
 'nillstring': {'@xsi:nil': 'true'}}

Suggested output:

{'@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
 'emptystring': '',
 'nillstring': None}

Proposal

I would suggest to change the parser mapping as follows:

Case xml python (current) python (suggested)
Empty <emptystring/> emptystring = None emptystring = ""
Null <nillstring xsi:nil="true"/> 'nillstring': {'@xsi:nil': 'true'} 'nilstring': None

If this is found feasable in the first case, other datatypes should be discussed.

th0ger avatar Aug 12 '22 14:08 th0ger

Meanwhile, do you have a good, efficient workaround for mapping <nillstring xsi:nil="true"/> into Python None?

th0ger avatar Aug 12 '22 14:08 th0ger

Hi,

your case is pretty complex and 'xsi:nil' is only partly related with the case of emptystring.

In fact the current python decode of 'nillstring' is correct, because {'@xsi:nil': 'true'} means that it is an element with an attribute but no content (otherwise it should be {'@xsi:nil': 'true', '$': ''}). For having only 'nullstring': None a custom decoder is needed (overriding the map_attributes() method of the default converter).

The difficult part is about decoding/encoding of <emptystring/>, for two reasons:

  1. Also ElementTree maps empty tag's text with None (so it seems to be normal to maintain the same decoding value);
  2. Also the XPath standard consider emptystring.text not a text node (absent), so putting an empty text alter this when you want to rebuild the XML with an encode.

So these two things have to be discussed and understood before decide how to change the decode/encode process.

My proposal is to decode with '' if this value is compatible with the schema type (no decode error, maybe ...). Encoding back restore the None value for emptystring.text.

brunato avatar Aug 16 '22 10:08 brunato

Hi @brunato,

Thanks for the response, and thanks for a great library. This is easy to use and I havn't found any other python library that can parse XML while respecting XSD datatypes.

I don't agree that "python decode of 'nillstring' is correct" (and I don't follow your argument, perhaps you can elaborate). From a principal point of view, we have the concept of Null (meaning missing information/no value set) across all languages, they may be called null, nil, None, void. But it still represents the same thing.

I do understand the upsides of keeping the functionality close to similar libraries (ElementTree, XPath), although that doesn't necessariliy make it "right". I also understand that my suggestion could break the API, needing lots of discussion from community and a major release. So I don't expect this to happen over night. Maybe easier customization of the converter is doable in practice.

Anyway, arguing doesn't help me out. I am provided with XMLs containing <emptystring/> and <nillstring xsi:nil="true"/> and corresponding XSDs, without the ability to change that. They need to be converted to "pythonic" "" and None. So I will try to customize the converter... Any hints?

th0ger avatar Aug 16 '22 11:08 th0ger

I don't agree that "python decode of 'nillstring' is correct" (and I don't follow your argument, perhaps you can elaborate).

Converters shape the decoded data. For default complex elements are decoded to dictionaries where attributes are the items with keys prefixed by a @, the text is mapped to the '$' key, the children are mapped to other keys by name, same tags collapsed in a list.

From a principal point of view, we have the concept of Null (meaning missing information/no value set) across all languages, they may be called null, nil, None, void. But it still represents the same thing.

See this http://wiki.open311.org/JSON_and_XML_Conversion/

As you can see ('status_notes' field) empty tags are mapped in a various way, not necessarily a null value.

I do understand the upsides of keeping the functionality close to similar libraries (ElementTree, XPath), although that doesn't necessariliy make it "right".

XPath is a standard for XML, so i have to consider that its interpretation of an XML document should be the correct way.

I also understand that my suggestion could break the API, needing lots of discussion from community and a major release. So I don't expect this to happen over night. Maybe easier customization of the converter is doable in practice.

The impact is still to be evaluated and it should be none or very limited if we decide to implement this behaviour only in a custom converter.

Anyway, arguing doesn't help me out. I am provided with XMLs containing <emptystring/> and <nillstring xsi:nil="true"/> and corresponding XSDs, without the ability to change that. They need to be converted to "pythonic" "" and None. So I will try to customize the converter... Any hints?

subclass XMLSchemaConverter and refactor element_decode() method, filtering out '{http://www.w3.org/2001/XMLSchema-instance}nil' from data.attributes and change the parts that use data.text.

brunato avatar Aug 16 '22 12:08 brunato

More details on xsi:nil and XML Schema content types

Hi @th0ger,

xsi:nil is one of the four attributes defined within "http://www.w3.org/2001/XMLSchema-instance" namespace. The meaning of these attributes is described in the XSD formal definition (Structures). In this document the xsi:nil description is:

2.7.2 xsi:nil

XML Schema Definition Language: Structures introduces a mechanism for signaling that an element 
must be accepted as valid when it has no content despite a content type which does not require or 
even necessarily allow empty content. An element can be valid without content if it has the attribute 
xsi:nil with the value true. An element so labeled must be empty, but can carry attributes if permitted 
by the corresponding complex type.

moreover the XML Schema has 4 varieties of content types:

  • empty: validates elements with no character or element information item (children)
  • simple: validates elements with character-only children using its simple type definition
  • element-only: validates elements with children that conform to the content model supplied by its particle
  • mixed: validates elements with children that conform to the content model supplied by its particle, accepting character data children (not only spaces)

If xsi:nil="true" the element is valid if and only if the content is empty, despite its effective XSD type.

Also default element values apply when elements are empty. So empty is not simply a value for an element (''), but a condition that is related with other schema options (and also with instance when you use XSI atttributes on it).

The condition of a None or empty string value for a empty tag is IMHO related to validation against its type. Clearly if the type is an xs:int the '' is not applicable, and i think a None value should be better in this case.

Questionable if the type is xs:string but the xs:minLength is '1'. In this case a programmatic check could be based on an isinstance check.

In any case final decision of whether to map None or empty string values could be delegated to a custom converter after the decode phase (this is the role of converters in xmlschema). For None values also the filler option can be used instead of a custom converter.

brunato avatar Aug 17 '22 09:08 brunato

For empty tags ElementTree admits also an empty string for elem.text instead of None (depends by the parser).

In any case an empty string is not considered a text node (§6.7.1 of XQuery and XPath Data Model 3.1).

I'm still oriented to use the empty string value if there are not validation errors (lax decode mode), as already happens when there are validation errors on simple values.

brunato avatar Aug 18 '22 09:08 brunato

I would add two options for decoding (to_dict/decode/iter_decode):

  • keep_empty: if set to True empty elements that are valid are decoded with an empty string value instead of a None
  • element_hook: an optional function that is called with decoded element data before calling the converter decode method. Takes an ElementData instance plus optionally the XSD element and the XSD type, and returns a new ElementData instance.

The 1st option change the default behavior with empty elements, that are decoded to an empty string, instead of None. The 2nd is a general option that permits to change decoded element data with a different one (e.g. filter a specific attribute).

Anyway one can create a custom converter to do that, but this is not the usual role of converters, that are dedicated to shaping data, not on decoding/encoding (i will clean them also from checks of data.text !='').

The nillable elements, as stated by the XSD standard, are decoded with nil/null content but retain the attributes, so cases like your are generally not decoded to None (e.g. the element has other attributes besides xsi:nil='true').

Looking at conventions nothing put in evidence that the empty string is the usual default for empty elements instead of None, so changing this default behavior is questionable. Passing keep_empty='true' change the behavior like you proposed and doesn't break back compatibility.

brunato avatar Aug 22 '22 07:08 brunato

  • keep_empty: if set to True empty elements that are valid are decoded with an empty string value instead of a None

So that would apply only to string elements (as declared in the XSD). right?

th0ger avatar Aug 22 '22 09:08 th0ger

  • keep_empty: if set to True empty elements that are valid are decoded with an empty string value instead of a None

So that would apply only to string elements (as declared in the XSD). right?

Yes, in practice with all XSD simple types that include the empty string in their value-space, so also simple strings (but not with derived strings that require a min string length > 0).

(that behavior is with 'lax' validation or 'strict' validation without errors, if you use 'skip' validation the decoding errors are ignored and the empty string values are kept)

brunato avatar Aug 22 '22 09:08 brunato

  • keep_empty: if set to True empty elements that are valid are decoded with an empty string value instead of a None

So that would apply only to string elements (as declared in the XSD). right?

Yes, in practice with all XSD simple types that include the empty string in their value-space, so also simple strings (but not with derived strings that require a min string length > 0).

(that behavior is with 'lax' validation or 'strict' validation without errors, if you use 'skip' validation the decoding errors are ignored and the empty string values are kept)

Correcting this after a test: None is used only if there is a decode error (e.g. decode '' to int), otherwise the empty string is kept also if there is a validation error (e.g. a facet constraint is not valid).

brunato avatar Aug 22 '22 17:08 brunato

Hi @th0ger,

the release v2.0.3 has the new options described above. For your reported case a possible usage can be:

schema = xmlschema.XMLSchema(xsd)

def filter_nil(element_data, *args):
    if not element_data.attributes:
        return element_data

    return xmlschema.ElementData(
        tag=element_data.tag,
        text=element_data.text,
        content=element_data.content,
        attributes=[x for x in element_data.attributes
                    if x[0] != '{http://www.w3.org/2001/XMLSchema-instance}nil']
    )

dict = schema.to_dict(xml, keep_empty=True, element_hook=filter_nil)
pprint(dict)

that has the output:

{'@xmlns:xsi': 'http://www.w3.org/2001/XMLSchema-instance',
 'emptystring': '',
 'nillstring': None}

brunato avatar Aug 25 '22 20:08 brunato

@brunato thanks a bunch! I have tested it, and it works like a charm. The performance is also unaffected.

th0ger avatar Aug 26 '22 05:08 th0ger

The performance is also unaffected.

It has been the occasion of realizing that in converters the emptiness checks on data.text were unnecessary. Removing them balance the additional check for testing if elem.text is empty/None.

Adding element_hook check has a little impact on decoding performance, but give a definitive option of doing complex customization on decoded data (if one knows what is doing ... ;) ).

In any case another optiion is to create a custom converter, but this shouldn't be the role of converters, that are mainly for data shaping, not on decoding or filtering the decoded data.

brunato avatar Aug 26 '22 07:08 brunato

If this solution satisfy you please close the issue.

Changing the default decode behavior may not have a very wide consensus, and can be a harsh decision to take even in a new major release.

Consider that for empty elements a definitive choice between '' on None is uncertain, also for compatibility of decode conventions that are not related with an XSD schema.

Furthermore the nillable elements also retain attributes, so removing xsi:nil don't assure that the element is decoded to a simple None.

Thank you

brunato avatar Aug 26 '22 07:08 brunato