xsdata icon indicating copy to clipboard operation
xsdata copied to clipboard

Deserialization bug: mixed content incorrectly nested in inner element

Open qthequartermasterman opened this issue 7 months ago โ€ข 2 comments

I originally posted this issue here: https://github.com/tefra/xsdata-pydantic/issues/38, but discovered that I can reproduce the same issue with some minor modifications in the core xsdata package.

This could very well be user error, but I believe I've encountered a deserialization issue where mixed content in an element is being incorrectly parsed into a nested child instead of remaining at the top level.

๐Ÿงช Minimal XSD example:

This is a sample xsd file to generate xsdata models that reproduce the issue:

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:complexType name="noteCT" mixed="true">
        <xs:sequence>
            <xs:element name="note" type="noteCT" minOccurs="0" maxOccurs="unbounded"/>
            <xs:choice minOccurs="0" maxOccurs="unbounded">
                <xs:element name="verse" type="verseCT"/>
            </xs:choice>
        </xs:sequence>
        <xs:attribute name="TEIform" fixed="note"/>
    </xs:complexType>

    <xs:complexType name="milestoneable">
        <xs:attribute name="sID" type="xs:string" use="optional"/>
        <xs:attribute name="eID" type="xs:string" use="optional"/>
    </xs:complexType>

    <xs:complexType name="verseCT" mixed="true">
        <xs:complexContent mixed="true">
            <xs:extension base="milestoneable">
                <xs:choice minOccurs="0" maxOccurs="unbounded">
                    <xs:element name="note" type="noteCT"/>
                </xs:choice>
            </xs:extension>
        </xs:complexContent>
    </xs:complexType>
</xs:schema>

This is adapted from this schema, and verse represents a Bible verse and note represents a inline footnote.

๐Ÿง‘โ€๐Ÿ’ป Reproduction steps:

  1. Run xsdata generate sample.xsd.
  2. Instantiate and serialize a VerseCt with mixed content and a nested NoteCt.
  3. Deserialize using XmlParser.from_string().

๐Ÿงต Code snippet:

Generated dataclass models

from dataclasses import dataclass, field
from typing import ForwardRef, Optional


@dataclass
class Milestoneable:
    class Meta:
        name = "milestoneable"

    s_id: Optional[str] = field(
        default=None,
        metadata={
            "name": "sID",
            "type": "Attribute",
        },
    )
    e_id: Optional[str] = field(
        default=None,
        metadata={
            "name": "eID",
            "type": "Attribute",
        },
    )


@dataclass
class NoteCt:
    class Meta:
        name = "noteCT"

    teiform: str = field(
        init=False,
        default="note",
        metadata={
            "name": "TEIform",
            "type": "Attribute",
        },
    )
    content: list[object] = field(
        default_factory=list,
        metadata={
            "type": "Wildcard",
            "namespace": "##any",
            "mixed": True,
            "choices": (
                {
                    "name": "note",
                    "type": ForwardRef("NoteCt"),
                    "namespace": "",
                },
                {
                    "name": "verse",
                    "type": ForwardRef("VerseCt"),
                    "namespace": "",
                },
            ),
        },
    )


@dataclass
class VerseCt(Milestoneable):
    class Meta:
        name = "verseCT"

    content: list[object] = field(
        default_factory=list,
        metadata={
            "type": "Wildcard",
            "namespace": "##any",
            "mixed": True,
            "choices": (
                {
                    "name": "note",
                    "type": NoteCt,
                    "namespace": "",
                },
            ),
        },
    )

Serialization Snippet

from xsdata.formats.dataclass.serializers.config import SerializerConfig
from xsdata.formats.dataclass.context import XmlContext
from xsdata.formats.dataclass.parsers import XmlParser
from xsdata.formats.dataclass.serializers import XmlSerializer

CONTEXT = XmlContext()
CONFIG = SerializerConfig(indent="  ")
PARSER = XmlParser(context=CONTEXT)
SERIALIZER = XmlSerializer(context=CONTEXT, config=CONFIG)


verse = VerseCt(
    content=[
        "This is before the note.",
        NoteCt(
            content=["This is a note inside the verse."],
        ),
        "This is after the note."
    ]
)

print(verse)

xml = SERIALIZER.render(verse)
print(xml)
new_verse = PARSER.from_string(xml, VerseCt)
print(new_verse)

๐Ÿงพ Output:

# Printed before serialization
VerseCt(s_id=None, e_id=None, content=['This is before the note.', NoteCt(teiform='note', content=['This is a note inside the verse.']), 'This is after the note.'])




# Serialized XML
<?xml version="1.0" encoding="UTF-8"?>
<verseCT>This is before the note.<note TEIform="note">This is a note inside the verse.</note>This is after the note.</verseCT>

# Parsed object
VerseCt(s_id=None, e_id=None, content=['This is before the note.', NoteCt(teiform='note', content=['This is a note inside the verse.', 'This is after the note.'])])  # โ— Did not expect "this is after the note" to end up in the content of NoteCt!
])

โŒ Problem

When deserialized, the final string "This is after the note." is incorrectly included as part of the NoteCt.content rather than being a sibling of it in VerseCt.content.

โœ… Expected

The top-level structure should be preserved:

VerseCt(content=[
    'This is before the note.',
    NoteCt(content=['This is a note inside the verse.']),
    'This is after the note.'
])

๐Ÿ“Œ Notes

  • The XML output appears to be correct and conform to the schema.
  • I have only seen this bug occur only occurs during deserialization, suggesting a parsing issue with mixed content and sibling elements.

Please let me know if this is user error, or if there is any other information I can give to help diagnose the problem.

qthequartermasterman avatar May 28 '25 21:05 qthequartermasterman

@tefra This seems to be because tail is appended to the contents while iterating over the Element Nodes with bind_wild_text. Is this intended behavior?

qthequartermasterman avatar May 28 '25 22:05 qthequartermasterman

@tefra This seems to be because tail is appended to the contents while iterating over the Element Nodes with bind_wild_text. Is this intended behavior?

It is a bug, not sure how to resolve it though with nested wildcards.

tefra avatar Jun 05 '25 02:06 tefra