quick-xml icon indicating copy to clipboard operation
quick-xml copied to clipboard

Deserialization of a doctype with very long content fails

Open benoitryder opened this issue 1 year ago • 6 comments

quick_xml::de::from_reader() parsing fails if the XML contains a doctype with content larger than the internal BufRead capacity. For instance

<!DOCTYPE [
<!-- A very very long comment *snipped* -->
]>
<X></X>

Here is a minimal code to reproduce this issue. It fails with an ExpectedStart error.

use std::io::Write;
use serde::Deserialize;

#[derive(Deserialize)]
struct X {}

fn main() {
    {
        let mut file = std::fs::File::create("test.xml").unwrap();
        let header = &"<!DOCTYPE X [<!--";
        let footer = &"-->]><X></X>";
        let padding = 8192 - (header.len() + 2);
        write!(file, "{header}{:1$}{footer}", "", padding).unwrap();
    }

    let file = std::fs::File::open("test.xml").unwrap();
    let reader = std::io::BufReader::new(file);
    let _: X = quick_xml::de::from_reader(reader).unwrap();
}

Cargo.toml content

[package]
name = "test"
version = "0.1.0"
edition = "2021"

[dependencies]
quick-xml = { version = "0.27.1", features = ["serialize"] }
serde = { version = "1.0", features = ["derive"] }
  • When decreasing the padding size, or using BufReader::with_capacity() to increase the buffer, even of 1 byte, there is no error.
  • Other BufRead implementations don't have this issue (checked with &[u8] and stdin).
  • Content does not have to be in one "block". The same issue occurs for a doctype split into multiple declarations and comments.
  • With a longer doctype with real content, the error may be different. For instance it may complains about an invalid ! from a !ENTITY tag.
  • No issue with serde-xml-rs, even for larger comments.
  • Tested on Windows, with rustc 1.66.0.

benoitryder avatar Dec 28 '22 20:12 benoitryder

This even happens if there's anything after the <!DOCTYPE> line:

  • comments
  • <?xml-stylesheet>

Can we fix this? I would be up to do this since from_str won't work otherwise.

hecatia-elegua avatar May 19 '23 17:05 hecatia-elegua

Feel free to submit a PR. Actually, the #590 seems a duplicate of this bug, so you can start from my findings there.

Mingun avatar May 19 '23 17:05 Mingun

Looks a lot like a trimming problem. StartTrimmer.trim_start doesn't get updated like it says "This field is set to true after reading each event except Event::Text and Event::CData". Changing this in a basic way leads to a few merge_text::pi_between tests failing, since for some reason spaces need to persist, while newlines do not?

hecatia-elegua avatar May 19 '23 19:05 hecatia-elegua

Unlikely. StartTrimmer.trim_start updated here: https://github.com/tafia/quick-xml/blob/84b07b4e5c9a201b3d3d7fa95c1e8c33ecbe6dd2/src/de/mod.rs#L2900

All whitespaces need to persist until we will sure that we can trim them. Motivation in #520

Mingun avatar May 19 '23 20:05 Mingun

It isn't only "each event except Event::Text and Event::CData" but also in this catch-all: https://github.com/tafia/quick-xml/blob/84b07b4e5c9a201b3d3d7fa95c1e8c33ecbe6dd2/src/de/mod.rs#L2898 Maybe docs just need an update here.

What I want might just be #561. hmm

hecatia-elegua avatar May 19 '23 20:05 hecatia-elegua

My problem might be off-topic, but #603 got the same idea.

hecatia-elegua avatar May 24 '23 19:05 hecatia-elegua