quick-xml
quick-xml copied to clipboard
Help deserialize mixed tags and string in body $value (html text formatting)
I'm trying to deserialize some dictionary defitnitions and came across this one which contains mixed multiple tags with normal string (html text formatting).
<div style="margin-left:2em"><b>1</b> 〔学業・技術などの能力判定〕 an examination; a test; 《口》 an exam; 《米》 a quiz 《<i>pl</i>. quizzes》.</div>
I looked around in serde-xml-rs tests and tried this solution which seems to be close but it doesn't quite work
#[derive(Debug, Deserialize, PartialEq)]
struct DivDefinition {
style: String,
#[serde(rename = "$value")]
definition: Vec<MyEnum>,
}
#[derive(Debug, Deserialize, PartialEq)]
enum MyEnum {
b(String),
#[serde(rename = "$value")]
String,
i(String),
}
The error I'm getting is:
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Custom("unknown variant `〔学業・技術などの能力判定〕 an examination; a test; 《口》 an exam; 《米》 a quiz 《`, expected one of `b`, `$value`, `i`")'
I can make it work for now by not using MyEnum and just use definition: Vec<String>, but then I wouldn't know which text is bold and which is italic.
How can I properly deserialize this?
Whoever picks this up, consider starting from https://github.com/tafia/quick-xml/pull/511
Has anybody found a workaround for this? I am having the same issue.
You can close this. Don't know when it was fixed but the original example works now with minor edits:
#[derive(Debug, Deserialize, PartialEq)]
struct DivDefinition {
#[serde(rename = "@style")]
style: String,
#[serde(rename = "$value")]
definition: Vec<MyEnum>,
}
#[derive(Debug, Deserialize, PartialEq)]
enum MyEnum {
b(String),
#[serde(rename = "$text")]
String,
i(String),
}
Thoughts on this idea? https://github.com/enricozb/quick-xml/commit/7b4b3f851a50ae9dbb45d54edfdc7c2374ec59d0
Specifically, I'm adding a new special field name $raw that can only deserialize into a String, and just writes all events, until the expected end event, into a string.
It lets you do stuff like this:
const xml: &str = r#"
<who-cares>
<foo property="value">
test
<bar><bii/><int>1</int></bar>
test
<baz/>
</foo>
</who-cares>
"#;
#[derive(Deserialize, Debug)]
struct Root {
#[serde(rename = "$raw")]
value: String,
}
let root = quick_xml::de::from_str::<Root>(&xml).unwrap();
println!("parsed: {root:?}");
This prints
parsed: Root { value: "<foo property=\"value\">test<bar><bii></bii><int>1</int></bar>test<baz></baz></foo>" }
One of the problems with this approach is that it doesn't save exactly what was in the XML file. This would be ideal because we could likely avoid any allocations, like serde_json::value::RawValue, and we would preserve formatting, and not trim spaces.
Another issue is that empty tags <bii/> get converted to <bii></bii> as that is how the events come in.
It's possible my initial idea could be fixed up to disable trimming temporarily of the reader during raw_string use.
Deserialization of RawValue in serde_json implemented as deserialization of a newtype with a special name:
https://github.com/serde-rs/json/blob/0131ac68212e8094bd14ee618587d731b4f9a68b/src/de.rs#L1711-L1724
The deserializer then returns data from it's own buffer of directly from input string, depending on what type is deserialized (Box<RawValue> or &RawValue). We can do the same because we have read_text, but right now only for borrowing reader. We need to implement #483 in order to implement read_text_into needed for owned reader.
Got it. I saw that private newtype name, but wasn't sure why it mattered. I see now that the json deserializer looks for this tag. I'll take a stab at this.
Additionally, I'm not sure if we should capture the surrounding tags or not. What should this print:
struct AnyName {
root: RawValue,
}
const xml: &str = "
<root>
<some/><inner/><tags/>
</root>
";
let x: AnyName = from_str(xml)?;
println!("{}", x.value);
Should this print
<root>
<some/><inner/><tags/>
</root>
or
<some/><inner/><tags/>