quick-xml Merge text and CDATA events in serde deserializer

Merge text and CDATA events in serde deserializer

Open Mingun opened this issue 1 year ago • 1 comments

CDATA elements cannot contain sequence ]]>. When that sequence is appeared in the data, it should be split into two pieces and each piece should be put in their own CDATA container:

]]>

become

<![CDATA[]]]>
<![CDATA[]>]]>

<![CDATA[]]]]>
<![CDATA[>]]>

Currently in serde deserializer only one CDATA event processed at time, that means, that deserialization

<root>
  <string><![CDATA[]]]]><![CDATA[>]]></string>
</root>

into

struct AnyName {
  string: String,
}

would fail or wrongly return ]] instead of ]]>.

To fix that we should merge CDATA events, that there are some ambiguities that should be investigated:

should we merge CDATA and text events:
```
<![CDATA[one]]>two
```
should return onetwo?

Judging from that and that, we should do that
should we ignore comments between CDATA events? Between CDATA and text events?
```
<![CDATA[one]]><![CDATA[two]]>
```
should return onetwo?
```
<![CDATA[one]]>two
```
should return onetwo? Currently all comments are skips at very early stage and deserializer sees
```
<![CDATA[one]]><![CDATA[two]]>
```
as
```
<![CDATA[one]]><![CDATA[two]]>
```
should we ignore processing instructions between CDATA events? Between CDATA and text events?
```
<![CDATA[one]]><?pi?><![CDATA[two]]>
```
should return onetwo?
```
<![CDATA[one]]><?pi?>two
```
should return onetwo? Currently all processing instructions are skips at very early stage and deserializer sees
```
<![CDATA[one]]><?pi?><![CDATA[two]]>
```
as
```
<![CDATA[one]]><![CDATA[two]]>
```
should we ignore whitespaces between CDATAs? Between CDATA and text?
```
<![CDATA[one]]>
<![CDATA[two]]>
```
should return onetwo?
```
<![CDATA[one]]>
two
```
should return onetwo?

Sep 09 '22 05:09 Mingun

I made some experiments with XmlBeans 5.0.0 -- a popular Java library to work with XML.

Use the following XSD:

<xs:schema xmlns:this="types.xsd"
           xmlns:xs="http://www.w3.org/2001/XMLSchema"
           targetNamespace="types.xsd"
           elementFormDefault="qualified"
           attributeFormDefault="unqualified"
>
  <xs:element name="Str" type="xs:string"/>
</xs:schema>

It skips comments and processing instructions and merge texts and CDATA sections, as suggested in the issue description. All white spaces are significant (namespace definition xmlns="types.xsd" is omitted for brevity in most examples):

XML	Result of `doc.getStr()`
`<Str xmlns="types.xsd"> text </Str>`	`text`
`<Str><![CDATA[cdata]]></Str>`	`cdata`
`<Str><![CDATA[cdata]]> with text</Str>`	`cdata with text`
`<Str><![CDATA[cdata]]]]><![CDATA[> with cdata]]></Str>`	`cdata]]> with cdata`
`<Str><![CDATA[cdata]]]]><!--comment--><![CDATA[> with comment between cdata]]></Str>`	`cdata]]> with comment between cdata`
`<Str><![CDATA[cdata]]]]><?pi?><![CDATA[> with PI between cdata]]></Str>`	`cdata]]> with PI between cdata`
`<Str><![CDATA[cdata ]]><!--comment-->with comment between text</Str>`	`cdata with comment between text`
`<Str><![CDATA[cdata ]]><?pi?>with PI between text</Str>`	`cdata with PI between text`

Sep 12 '22 05:09 Mingun

Unfortunately, this is not the easy task, because of trim feature, that is activated for serde deserializer. That means, that spaces between CDATA section and text will be trimmed, and it is not easely to fix that, because to do that correctly, we need to lookahead at infinity depth to solve such situations:

text
<!--comment 1-->
<!--comment 2-->
...
<!--comment N--><![CDATA[cdata section]]>

We should not strip between text and CDATA, but should trim between text and tag.

Because comments should not change the content of document, that document is equivalent to:

text


...
<![CDATA[cdata section]]>

("text" + N newlines + "cdata section"). Probably solving #460 first will make that easier to implement.

Nov 05 '22 19:11 Mingun

quick-xml quick-xml copied to clipboard

Merge text and CDATA events in serde deserializer

quick-xml
quick-xml copied to clipboard