xml-rs
xml-rs copied to clipboard
Parsing XML preceded by Unicode BOM triggers an error
When loading an XML file that begins with a byte order mark, xml-rs raises a parsing error.
Found indirectly when using the xmltree
crate
Here is the first few bytes of my XML file (cat ATA6616C.atdf|xxd|head -n2
)
00000000: efbb bf3c 3f78 6d6c 2076 6572 7369 6f6e ...<?xml version
00000010: 3d27 312e 3027 2065 6e63 6f64 696e 673d ='1.0' encoding=
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: MalformedXml(Error { pos: 1:1, kind: Syntax("Unexpected characters outside the root element: \u{feff}") })',
/checkout/src/libcore/result.rs:860
You can see the entire files that reproduce the error in the commit I just referenced above this comment.
Having the same problem
Error: Error { pos: 1:1, kind: Syntax("Unexpected characters outside the root element: \u{feff}") }
Hello ! An user recently reported a bug to me that tracks down to this. I see there was a pull request opened in 2017 that has never been merged. @netvl, would you be ready to merge a PR that would implement the approach you mention in the PR of just adding a thin Read
wrapper that ignores leading byte order marks ?
@lovasoa if you want to create such a wrapper, you don't really need to add it to this library - it would be much easier to publish it as a standalone crate, because it certainly has utility outside of the domain of XML parsing.
As I mentioned elsewhere, in the reimplementation branch this problem will be fixed automatically due to how encoding detection with encoding_rs
works.
Yes, I understand I can work around this bug in my own programs, but I was proposing to work on a patch that will fix it for everyone, since it seems to affect many users. Of course, once you merge your branch adding support for other encodings, you can just ditch this Read wrapper.
If you agree to merge a temporary solution, you can just close this bug today and hopefully never have to reopen it.
work around this bug in my own programs
That's not what I meant. I meant that the wrapper does not really belong to the library, and it certainly won't be enable by default because it would require changing types in the API. Therefore, it would be much more beneficial to the wider community to have this as a separate, independent crate, rather than a part of xml_rs
.
Having it as a separate crate does not have any downsides (except maybe another crate in dependencies, but with Cargo it's not a big problem), but in terms of general reuse it would be much better, for example, for people who don't need XML parsing but do want to remove BOMs for other purposes.
Even when used with this library the code would look almost the same:
// external crate
use bom_remover::BomRemover;
let file = File::open("file.xml").unwrap();
let file = BufReader::new(file):
let file = BomRemover::new(file);
let reader = EventReader::new(file);
// built into xml_rs
use xml::BomRemover;
let file = File::open("file.xml").unwrap();
let file = BufReader::new(file):
let file = BomRemover::new(file);
let reader = EventReader::new(file);
As a bonus point, you won't really need to wait for my review :) And I'm totally all right with adding a link to this new crate somewhere in the docs with a note that if people want to parse XML documents with BOMs, they can use this crate, with some examples.
it would require changing types in the API
Why is that exactly ? Why would we need to expose this wrapper ? It should be completely transparent. The read wrapper will be created and stored as a private field in EventReader
. Is there something that I do not see ?
The problem with having an external crate is that most people who want to parse XML in rust are simply not aware of this bug. They are not actively looking for ways to work around it. Just like me, they write their program, test it, everything works, and some time later, in production, the code encounters an xml string that starts with a BOM, and fails. I really think that this library is the right place to fix it, and I'm ready to help.
Maybe I misunderstood your comment on the original PR ? You were not proposing a wrapper that would be internal to xml-rs ? How would you then fix the bug when parsing XML from a str, which may also start with a BOM character ?
In https://github.com/netvl/xml-rs/pull/213 I propose a from_bytes()
convenience wrapper for this purposes.
let mut xmlcontent=std::io::read_to_string(xmlfile).unwrap().replace("\u{feff}", "");
BOM is now supported.