xml-rs icon indicating copy to clipboard operation
xml-rs copied to clipboard

Sequence ']]>' is not escaped in PCDATA

Open FraGag opened this issue 8 years ago • 5 comments

The sequence ]]> must not occur in PCDATA, but xml-rs fails to escape the > in this sequence when emitted with EventWriter in an XmlEvent::Characters. xml-rs correctly gives an error when attempting to parse the erroneous document.

The fault probably lies on xml::escape::escape_str_pcdata, which handles escaping special characters for a PCDATA context. There are two options for resolving this: either systematically escaping > as > or checking for the sequence ]]> specifically and only escape > when it is part of that sequence.

Sample program to reproduce the problem:

extern crate xml;

use xml::reader::EventReader;
use xml::writer::EventWriter;
use xml::writer::events::XmlEvent;

fn main() {
    let mut v = Vec::new();
    {
        let mut ew = EventWriter::new(&mut v);
        ew.write(XmlEvent::start_element("root")).unwrap();
        ew.write(XmlEvent::characters("invalid ]]> invalid")).unwrap();
        ew.write(XmlEvent::end_element()).unwrap();
    }

    let er = EventReader::new(&v[..]);
    for ev in er {
        println!("{:?}", ev);
    }
}

Output:

Ok(StartDocument(1.0, utf-8, None))
Ok(StartElement(root, {"": "", "xml": "http://www.w3.org/XML/1998/namespace", "xmlns": "http://www.w3.org/2000/xmlns/"}))
Err(Error { pos: 1:53, kind: Syntax("Unexpected token: ]]>") })

FraGag avatar Sep 06 '16 01:09 FraGag

EventWriter was made as simple as possible initially, so it does not check for such invalid sequences. Actually, I'm not sure if it is possible to check for them reliably. For example, you can write two character events sequentially, XmlEvent::characters("]]") and XmlEvent::characters(">"). They will result into an invalid XML document, but I doubt that it would be possible to check for them because it writes directly to the output stream, without explicit buffering, and I do not want to introduce such buffering.

netvl avatar Sep 06 '16 10:09 netvl

For example, you can write two character events sequentially, XmlEvent::characters("]]") and XmlEvent::characters(">"). They will result into an invalid XML document, but I doubt that it would be possible to check for them because it writes directly to the output stream, without explicit buffering, and I do not want to introduce such buffering.

A reliable, lightweight way to check for such sequences is to have a state machine that recognizes the ]]> or -- sequences and is plugged just before the final write to the output stream.

whitequark avatar Apr 23 '17 17:04 whitequark

Well, a state machine like this is a kind of a buffer, and it would require that the entire output goes through it, code point by code point. I'm not really sure it is worth the complexity it would introduce.

netvl avatar Apr 24 '17 19:04 netvl

Well, a state machine like this is a kind of a buffer

I don't see what about a state machine is like a buffer at all.

I'm not really sure it is worth the complexity it would introduce.

I don't see the complexity you mention, either. The entire thing is one wrapper struct implementing the Write trait. Besides, there can be no argument on whether it's "worth the complexity" while the straighforward way to use the library causes it to produce invalid XML in cases most people will likely not check for.

whitequark avatar Apr 25 '17 01:04 whitequark

Maybe the docs should be updated to warn that closing cdata characters are not escaped. Currently the cdata function gives the impression that the closing characters are escaped:

https://docs.rs/xml-rs/0.8.3/xml/writer/events/enum.XmlEvent.html#method.cdata

jsoo1 avatar May 31 '20 04:05 jsoo1