Whitespaces around character entities
I maintain a large codebase that used quick-xml for years across hundreds of structs.
After switching from 0.37.5 to 0.38, I hit the already well-known whitespace issues. 0.38.1 introduced Config::trim_text which saved me from touching thousands of fields, but now I found the next caveat: XML character entities make trim_text(true) trim more than needed:
#!/usr/bin/env cargo
---
[dependencies]
serde = { version = "1", features = ["derive"] }
quick-xml = { version = "0.38", features = ["serialize"] }
quick-xml-old = { package = "quick-xml", version = "0.37", features = ["serialize"] }
---
use quick_xml::reader::NsReader;
use serde::Deserialize;
use std::result::Result;
#[derive(Debug, Deserialize)]
struct Root {
bar: String,
}
pub fn from_str_trim_text<'de, T>(s: &'de str) -> Result<T, Box<dyn std::error::Error>>
where
T: Deserialize<'de>,
{
let mut reader = NsReader::from_str(s);
let config = reader.config_mut();
config.trim_text(true); // Trims also spaces around decoded &
config.expand_empty_elements = true;
let mut de = quick_xml::de::Deserializer::borrowing(reader);
Ok(T::deserialize(&mut de)?)
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let xml = r#"
<root>
<bar>
Foo & Bar & Baz Qux Quux
</bar>
</root>"#;
let old: Root = quick_xml_old::de::from_str(xml)?;
let new: Root = quick_xml::de::from_str(xml)?;
let new_trimming: Root = from_str_trim_text(xml)?;
println!("Old: {}", old.bar); // "Foo & Bar & Baz Qux Quux"
println!("New: {}", new.bar); // "\n Foo & Bar & Baz Qux Quux\n "
println!("New trimming: {}", new_trimming.bar); // "Foo&Bar&Baz Qux Quux"
Ok(())
}
(cargo +nightly -Zscript example.rs)
I know this is documented behavior (thank you), so I'm not calling it a bug. I’m looking for a solution to preserve "internal" spaces (e.g, around the decoded entities) while still trimming leading/trailing whitespace globally.
The only workaround that comes to my mind atm is either introducing a newtype or using deserialize_with, which are both rather error-prone in a mature codebase like mine and require tons of work.
If there is no existing solution right now, I could try to implement a feature (e.g., a new config option?) if the maintainers think it is a good idea.
Actually, using config.trim_text(true); is a footgun, because it trims text too early. I hope, it will be removed sometime in the future. Until that, maybe it is possible to do something with that if find a way to know that trim is not required if "text" is followed by the general entity reference (which is started from &). Anyway, even with such fix trim will work incorrectly when XML comment will be inserted between them. That may open doors to some vulnerabilities when attacker can construct XML with the same logical structure by adding comments, which will influence the final result.
Feel free to investigate that and make a PR.
What if the feature were implemented specifically for the serde Deserializer?
I've played with this idea by introducing DeserializerConfig which can be used like that:
let mut de = quick_xml::de::Deserializer::from_str(s);
de.config_mut().trim_text(true);
T::deserialize(&mut de)
My initial approach was to put this config into XmlReader, but eventually I decided that the top-most "layer" is the best.
I'm not making a PR yet since I first wanted to know what do you think about this idea.
I found, that current way to configure quick-xml a bit worse
let mut r = quick_xml::reader::Reader::from_str(s);
*r.config_mut() = cfg;
// working with r...
I feel that something like this is more traditional approach:
let cfg = quick_xml::reader::Config {
// configure what you want...
Default::default()..
};
// consumes cfg
let mut r = cfg.from_str(s);// r: Reader<...>
// working with r...
It simplifies storing configuration somewhere and create readers based on it. Of course, you may do that in the current design, just by overriding the whole Config struct just after parser creation, as shown above.
For the Deserializer the same approach would be better, as I think.
In any case, it seems that for this case it would be simplifier to make changes in Reader instead of introducing configuration for the deserializer (although such configuration would be useful for #897). Anyway, if you have working prototypes for PR, make it and there we'll see.