quick-xml
quick-xml copied to clipboard
Deserialization of a doctype with very long content fails
quick_xml::de::from_reader()
parsing fails if the XML contains a doctype with content larger than the internal BufRead
capacity.
For instance
<!DOCTYPE [
<!-- A very very long comment *snipped* -->
]>
<X></X>
Here is a minimal code to reproduce this issue. It fails with an ExpectedStart
error.
use std::io::Write;
use serde::Deserialize;
#[derive(Deserialize)]
struct X {}
fn main() {
{
let mut file = std::fs::File::create("test.xml").unwrap();
let header = &"<!DOCTYPE X [<!--";
let footer = &"-->]><X></X>";
let padding = 8192 - (header.len() + 2);
write!(file, "{header}{:1$}{footer}", "", padding).unwrap();
}
let file = std::fs::File::open("test.xml").unwrap();
let reader = std::io::BufReader::new(file);
let _: X = quick_xml::de::from_reader(reader).unwrap();
}
Cargo.toml
content
[package]
name = "test"
version = "0.1.0"
edition = "2021"
[dependencies]
quick-xml = { version = "0.27.1", features = ["serialize"] }
serde = { version = "1.0", features = ["derive"] }
- When decreasing the padding size, or using
BufReader::with_capacity()
to increase the buffer, even of 1 byte, there is no error. - Other
BufRead
implementations don't have this issue (checked with&[u8]
andstdin
). - Content does not have to be in one "block". The same issue occurs for a doctype split into multiple declarations and comments.
- With a longer doctype with real content, the error may be different. For instance it may complains about an invalid
!
from a!ENTITY
tag. - No issue with
serde-xml-rs
, even for larger comments. - Tested on Windows, with rustc 1.66.0.
This even happens if there's anything after the <!DOCTYPE>
line:
- comments
-
<?xml-stylesheet>
Can we fix this? I would be up to do this since from_str
won't work otherwise.
Feel free to submit a PR. Actually, the #590 seems a duplicate of this bug, so you can start from my findings there.
Looks a lot like a trimming problem.
StartTrimmer.trim_start
doesn't get updated like it says "This field is set to true after reading each event except Event::Text and Event::CData". Changing this in a basic way leads to a few merge_text::pi_between
tests failing, since for some reason spaces need to persist, while newlines do not?
Unlikely. StartTrimmer.trim_start
updated here:
https://github.com/tafia/quick-xml/blob/84b07b4e5c9a201b3d3d7fa95c1e8c33ecbe6dd2/src/de/mod.rs#L2900
All whitespaces need to persist until we will sure that we can trim them. Motivation in #520
It isn't only "each event except Event::Text and Event::CData" but also in this catch-all: https://github.com/tafia/quick-xml/blob/84b07b4e5c9a201b3d3d7fa95c1e8c33ecbe6dd2/src/de/mod.rs#L2898 Maybe docs just need an update here.
What I want might just be #561. hmm
My problem might be off-topic, but #603 got the same idea.