xml-rs
xml-rs copied to clipboard
Sequence ']]>' is not escaped in PCDATA
The sequence ]]>
must not occur in PCDATA, but xml-rs fails to escape the >
in this sequence when emitted with EventWriter
in an XmlEvent::Characters
. xml-rs correctly gives an error when attempting to parse the erroneous document.
The fault probably lies on xml::escape::escape_str_pcdata
, which handles escaping special characters for a PCDATA context. There are two options for resolving this: either systematically escaping >
as >
or checking for the sequence ]]>
specifically and only escape >
when it is part of that sequence.
Sample program to reproduce the problem:
extern crate xml;
use xml::reader::EventReader;
use xml::writer::EventWriter;
use xml::writer::events::XmlEvent;
fn main() {
let mut v = Vec::new();
{
let mut ew = EventWriter::new(&mut v);
ew.write(XmlEvent::start_element("root")).unwrap();
ew.write(XmlEvent::characters("invalid ]]> invalid")).unwrap();
ew.write(XmlEvent::end_element()).unwrap();
}
let er = EventReader::new(&v[..]);
for ev in er {
println!("{:?}", ev);
}
}
Output:
Ok(StartDocument(1.0, utf-8, None))
Ok(StartElement(root, {"": "", "xml": "http://www.w3.org/XML/1998/namespace", "xmlns": "http://www.w3.org/2000/xmlns/"}))
Err(Error { pos: 1:53, kind: Syntax("Unexpected token: ]]>") })
EventWriter
was made as simple as possible initially, so it does not check for such invalid sequences. Actually, I'm not sure if it is possible to check for them reliably. For example, you can write two character events sequentially, XmlEvent::characters("]]")
and XmlEvent::characters(">")
. They will result into an invalid XML document, but I doubt that it would be possible to check for them because it writes directly to the output stream, without explicit buffering, and I do not want to introduce such buffering.
For example, you can write two character events sequentially, XmlEvent::characters("]]") and XmlEvent::characters(">"). They will result into an invalid XML document, but I doubt that it would be possible to check for them because it writes directly to the output stream, without explicit buffering, and I do not want to introduce such buffering.
A reliable, lightweight way to check for such sequences is to have a state machine that recognizes the ]]>
or --
sequences and is plugged just before the final write to the output stream.
Well, a state machine like this is a kind of a buffer, and it would require that the entire output goes through it, code point by code point. I'm not really sure it is worth the complexity it would introduce.
Well, a state machine like this is a kind of a buffer
I don't see what about a state machine is like a buffer at all.
I'm not really sure it is worth the complexity it would introduce.
I don't see the complexity you mention, either. The entire thing is one wrapper struct implementing the Write
trait. Besides, there can be no argument on whether it's "worth the complexity" while the straighforward way to use the library causes it to produce invalid XML in cases most people will likely not check for.
Maybe the docs should be updated to warn that closing cdata characters are not escaped. Currently the cdata function gives the impression that the closing characters are escaped:
https://docs.rs/xml-rs/0.8.3/xml/writer/events/enum.XmlEvent.html#method.cdata