quick-xml
quick-xml copied to clipboard
Merge text and CDATA events in serde deserializer
CDATA elements cannot contain sequence ]]>
. When that sequence is appeared in the data, it should be split into two pieces and each piece should be put in their own CDATA container:
]]>
become
<![CDATA[]]]>
<![CDATA[]>]]>
or
<![CDATA[]]]]>
<![CDATA[>]]>
Currently in serde deserializer only one CDATA event processed at time, that means, that deserialization
<root>
<string><![CDATA[]]]]><![CDATA[>]]></string>
</root>
into
struct AnyName {
string: String,
}
would fail or wrongly return ]]
instead of ]]>
.
To fix that we should merge CDATA events, that there are some ambiguities that should be investigated:
- should we merge CDATA and text events:
should return<![CDATA[one]]>two
onetwo
? - should we ignore comments between CDATA events? Between CDATA and text events?
should return<![CDATA[one]]><!--comment--><![CDATA[two]]>
onetwo
?
should return<![CDATA[one]]><!--comment-->two
onetwo
? Currently all comments are skips at very early stage and deserializer sees
as<![CDATA[one]]><!--comment--><![CDATA[two]]>
<![CDATA[one]]><![CDATA[two]]>
- should we ignore processing instructions between CDATA events? Between CDATA and text events?
should return<![CDATA[one]]><?pi?><![CDATA[two]]>
onetwo
?
should return<![CDATA[one]]><?pi?>two
onetwo
? Currently all processing instructions are skips at very early stage and deserializer sees
as<![CDATA[one]]><?pi?><![CDATA[two]]>
<![CDATA[one]]><![CDATA[two]]>
- should we ignore whitespaces between CDATAs? Between CDATA and text?
should return<![CDATA[one]]> <![CDATA[two]]>
onetwo
?
should return<![CDATA[one]]> two
onetwo
?
I made some experiments with XmlBeans 5.0.0 -- a popular Java library to work with XML.
Use the following XSD:
<xs:schema xmlns:this="types.xsd"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
targetNamespace="types.xsd"
elementFormDefault="qualified"
attributeFormDefault="unqualified"
>
<xs:element name="Str" type="xs:string"/>
</xs:schema>
It skips comments and processing instructions and merge texts and CDATA sections, as suggested in the issue description. All white spaces are significant (namespace definition xmlns="types.xsd"
is omitted for brevity in most examples):
XML |
Result of |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Unfortunately, this is not the easy task, because of trim feature, that is activated for serde deserializer. That means, that spaces between CDATA section and text will be trimmed, and it is not easely to fix that, because to do that correctly, we need to lookahead at infinity depth to solve such situations:
text
<!--comment 1-->
<!--comment 2-->
...
<!--comment N--><![CDATA[cdata section]]>
We should not strip between text and CDATA, but should trim between text and tag.
Because comments should not change the content of document, that document is equivalent to:
text
...
<![CDATA[cdata section]]>
("text" + N newlines + "cdata section"). Probably solving #460 first will make that easier to implement.