Deserialization bug: mixed content incorrectly nested in inner element
I originally posted this issue here: https://github.com/tefra/xsdata-pydantic/issues/38, but discovered that I can reproduce the same issue with some minor modifications in the core xsdata package.
This could very well be user error, but I believe I've encountered a deserialization issue where mixed content in an element is being incorrectly parsed into a nested child instead of remaining at the top level.
๐งช Minimal XSD example:
This is a sample xsd file to generate xsdata models that reproduce the issue:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:complexType name="noteCT" mixed="true">
<xs:sequence>
<xs:element name="note" type="noteCT" minOccurs="0" maxOccurs="unbounded"/>
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="verse" type="verseCT"/>
</xs:choice>
</xs:sequence>
<xs:attribute name="TEIform" fixed="note"/>
</xs:complexType>
<xs:complexType name="milestoneable">
<xs:attribute name="sID" type="xs:string" use="optional"/>
<xs:attribute name="eID" type="xs:string" use="optional"/>
</xs:complexType>
<xs:complexType name="verseCT" mixed="true">
<xs:complexContent mixed="true">
<xs:extension base="milestoneable">
<xs:choice minOccurs="0" maxOccurs="unbounded">
<xs:element name="note" type="noteCT"/>
</xs:choice>
</xs:extension>
</xs:complexContent>
</xs:complexType>
</xs:schema>
This is adapted from this schema, and verse represents a Bible verse and note represents a inline footnote.
๐งโ๐ป Reproduction steps:
- Run
xsdata generate sample.xsd. - Instantiate and serialize a
VerseCtwith mixed content and a nestedNoteCt. - Deserialize using
XmlParser.from_string().
๐งต Code snippet:
Generated dataclass models
from dataclasses import dataclass, field
from typing import ForwardRef, Optional
@dataclass
class Milestoneable:
class Meta:
name = "milestoneable"
s_id: Optional[str] = field(
default=None,
metadata={
"name": "sID",
"type": "Attribute",
},
)
e_id: Optional[str] = field(
default=None,
metadata={
"name": "eID",
"type": "Attribute",
},
)
@dataclass
class NoteCt:
class Meta:
name = "noteCT"
teiform: str = field(
init=False,
default="note",
metadata={
"name": "TEIform",
"type": "Attribute",
},
)
content: list[object] = field(
default_factory=list,
metadata={
"type": "Wildcard",
"namespace": "##any",
"mixed": True,
"choices": (
{
"name": "note",
"type": ForwardRef("NoteCt"),
"namespace": "",
},
{
"name": "verse",
"type": ForwardRef("VerseCt"),
"namespace": "",
},
),
},
)
@dataclass
class VerseCt(Milestoneable):
class Meta:
name = "verseCT"
content: list[object] = field(
default_factory=list,
metadata={
"type": "Wildcard",
"namespace": "##any",
"mixed": True,
"choices": (
{
"name": "note",
"type": NoteCt,
"namespace": "",
},
),
},
)
Serialization Snippet
from xsdata.formats.dataclass.serializers.config import SerializerConfig
from xsdata.formats.dataclass.context import XmlContext
from xsdata.formats.dataclass.parsers import XmlParser
from xsdata.formats.dataclass.serializers import XmlSerializer
CONTEXT = XmlContext()
CONFIG = SerializerConfig(indent=" ")
PARSER = XmlParser(context=CONTEXT)
SERIALIZER = XmlSerializer(context=CONTEXT, config=CONFIG)
verse = VerseCt(
content=[
"This is before the note.",
NoteCt(
content=["This is a note inside the verse."],
),
"This is after the note."
]
)
print(verse)
xml = SERIALIZER.render(verse)
print(xml)
new_verse = PARSER.from_string(xml, VerseCt)
print(new_verse)
๐งพ Output:
# Printed before serialization
VerseCt(s_id=None, e_id=None, content=['This is before the note.', NoteCt(teiform='note', content=['This is a note inside the verse.']), 'This is after the note.'])
# Serialized XML
<?xml version="1.0" encoding="UTF-8"?>
<verseCT>This is before the note.<note TEIform="note">This is a note inside the verse.</note>This is after the note.</verseCT>
# Parsed object
VerseCt(s_id=None, e_id=None, content=['This is before the note.', NoteCt(teiform='note', content=['This is a note inside the verse.', 'This is after the note.'])]) # โ Did not expect "this is after the note" to end up in the content of NoteCt!
])
โ Problem
When deserialized, the final string "This is after the note." is incorrectly included as part of the NoteCt.content rather than being a sibling of it in VerseCt.content.
โ Expected
The top-level structure should be preserved:
VerseCt(content=[
'This is before the note.',
NoteCt(content=['This is a note inside the verse.']),
'This is after the note.'
])
๐ Notes
- The XML output appears to be correct and conform to the schema.
- I have only seen this bug occur only occurs during deserialization, suggesting a parsing issue with mixed content and sibling elements.
Please let me know if this is user error, or if there is any other information I can give to help diagnose the problem.
@tefra This seems to be because tail is appended to the contents while iterating over the Element Nodes with bind_wild_text. Is this intended behavior?
@tefra This seems to be because
tailis appended to the contents while iterating over the Element Nodes withbind_wild_text. Is this intended behavior?
It is a bug, not sure how to resolve it though with nested wildcards.