quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Lingering serde whitespace bug

Open jamwil opened this issue 3 months ago • 6 comments

Thanks to the work of @ggodlewski, whitespace is now preserved by default according to the xml specification when deserializing with serde. I note, however, one lingering apparent inconsistency in the case of deserializing a string consisting of only whitespace.

I've isolated the issue in these contrasting passing and failing tests.

Is this intended behaviour?

jamwil avatar Aug 30 '25 19:08 jamwil

The reason is in here: https://github.com/tafia/quick-xml/blob/master/src/de/mod.rs#L2792

Check the implementation of is_blank https://github.com/tafia/quick-xml/blob/master/src/de/mod.rs#L2253

I think it's intended: https://github.com/tafia/quick-xml/blob/master/src/de/mod.rs#L3368

ggodlewski avatar Aug 30 '25 20:08 ggodlewski

Thanks for that context. So if I'm understanding correctly, this is actually due to the xml:space attribute not yet being implemented?

My interpretation of the xml spec (and I could be misreading this) from section 2.10 White Space Handling, is that an xml processor must pass the whitespace through to the application, and the application gets to decide how to handle the xml:space attribute. If quick-xml is a processor, shouldn't the determination of what is significant or insignificant whitespace be done downstream?

Looping in @Mingun since he authored https://github.com/tafia/quick-xml/commit/2b85835968764820d48002dc75789cdd7a0bbe42 which handles this case.

jamwil avatar Aug 31 '25 01:08 jamwil

My interpretation of the xml spec (and I could be misreading this) from section 2.10 White Space Handling, is that an xml processor must pass the whitespace through to the application, and the application gets to decide how to handle the xml:space attribute.

Yes, you're absolutely right. Reader is an "xml processor", but Deserializer is an "application". Currently there are no ways to configure some aspects of it, including handling of whitespace-only text events. Handling of xml:space is one of such tasks (part of #464).

Mingun avatar Aug 31 '25 09:08 Mingun

Actually, this may be fixed if this line will be executed conditionally: https://github.com/tafia/quick-xml/blob/655691c7c394ba6c9dcc3c47725d9ec7ce80f094/src/de/map.rs#L276

One possible solution, replace it with this code (and fix 4 tests in serde-de-seq failed due to this):

// If we have dedicated "$text" field, whitespace-only text may be significant.
// That also means, that type with `$text` fields behaves as if its element has
// `xml:space="preserve"` attribute: you must not have pretty-print indents
// inside this element.
if !self.has_text_field {
    self.skip_whitespaces()?;
}

It seems quite logical change, but I'm not sure, how many things like

<root>
  <field1/>
  some text
  <field2/>
</root>

it will break. You will have to write this document as

<root><field1/>
  some text
  <field2/></root>

to be able to deserialize it with non-vec $text field.

Mingun avatar Aug 31 '25 10:08 Mingun

@Mingun What do you think about the potential solution in #897? It assumes a text event that consists exclusively of whitespace and contains a line break is insignificant. A whitespace event with no line breaks is treated as significant and preserved.

All existing tests pass without modification using this heuristic.

jamwil avatar Aug 31 '25 14:08 jamwil

I've revised #897 to align with your original suggestion.

jamwil avatar Aug 31 '25 20:08 jamwil