xml-rs icon indicating copy to clipboard operation
xml-rs copied to clipboard

Parsing XML preceded by Unicode BOM triggers an error

Open dylanmckay opened this issue 6 years ago • 10 comments

When loading an XML file that begins with a byte order mark, xml-rs raises a parsing error.

Found indirectly when using the xmltree crate

Here is the first few bytes of my XML file (cat ATA6616C.atdf|xxd|head -n2)

00000000: efbb bf3c 3f78 6d6c 2076 6572 7369 6f6e  ...<?xml version
00000010: 3d27 312e 3027 2065 6e63 6f64 696e 673d  ='1.0' encoding=
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: MalformedXml(Error { pos: 1:1, kind: Syntax("Unexpected characters outside the root element: \u{feff}") })',
/checkout/src/libcore/result.rs:860

dylanmckay avatar Aug 22 '17 10:08 dylanmckay

You can see the entire files that reproduce the error in the commit I just referenced above this comment.

dylanmckay avatar Aug 22 '17 10:08 dylanmckay

Having the same problem

Error: Error { pos: 1:1, kind: Syntax("Unexpected characters outside the root element: \u{feff}") }

wucke13 avatar Jul 01 '19 08:07 wucke13

Hello ! An user recently reported a bug to me that tracks down to this. I see there was a pull request opened in 2017 that has never been merged. @netvl, would you be ready to merge a PR that would implement the approach you mention in the PR of just adding a thin Read wrapper that ignores leading byte order marks ?

lovasoa avatar May 19 '20 16:05 lovasoa

@lovasoa if you want to create such a wrapper, you don't really need to add it to this library - it would be much easier to publish it as a standalone crate, because it certainly has utility outside of the domain of XML parsing.

As I mentioned elsewhere, in the reimplementation branch this problem will be fixed automatically due to how encoding detection with encoding_rs works.

netvl avatar May 20 '20 07:05 netvl

Yes, I understand I can work around this bug in my own programs, but I was proposing to work on a patch that will fix it for everyone, since it seems to affect many users. Of course, once you merge your branch adding support for other encodings, you can just ditch this Read wrapper.

lovasoa avatar May 20 '20 07:05 lovasoa

If you agree to merge a temporary solution, you can just close this bug today and hopefully never have to reopen it.

lovasoa avatar May 20 '20 07:05 lovasoa

work around this bug in my own programs

That's not what I meant. I meant that the wrapper does not really belong to the library, and it certainly won't be enable by default because it would require changing types in the API. Therefore, it would be much more beneficial to the wider community to have this as a separate, independent crate, rather than a part of xml_rs.

Having it as a separate crate does not have any downsides (except maybe another crate in dependencies, but with Cargo it's not a big problem), but in terms of general reuse it would be much better, for example, for people who don't need XML parsing but do want to remove BOMs for other purposes.

Even when used with this library the code would look almost the same:

// external crate
use bom_remover::BomRemover;

let file = File::open("file.xml").unwrap();
let file = BufReader::new(file):
let file = BomRemover::new(file);
let reader = EventReader::new(file);
// built into xml_rs
use xml::BomRemover;

let file = File::open("file.xml").unwrap();
let file = BufReader::new(file):
let file = BomRemover::new(file);
let reader = EventReader::new(file);

As a bonus point, you won't really need to wait for my review :) And I'm totally all right with adding a link to this new crate somewhere in the docs with a note that if people want to parse XML documents with BOMs, they can use this crate, with some examples.

netvl avatar May 20 '20 07:05 netvl

it would require changing types in the API

Why is that exactly ? Why would we need to expose this wrapper ? It should be completely transparent. The read wrapper will be created and stored as a private field in EventReader. Is there something that I do not see ?

The problem with having an external crate is that most people who want to parse XML in rust are simply not aware of this bug. They are not actively looking for ways to work around it. Just like me, they write their program, test it, everything works, and some time later, in production, the code encounters an xml string that starts with a BOM, and fails. I really think that this library is the right place to fix it, and I'm ready to help.

lovasoa avatar May 20 '20 09:05 lovasoa

Maybe I misunderstood your comment on the original PR ? You were not proposing a wrapper that would be internal to xml-rs ? How would you then fix the bug when parsing XML from a str, which may also start with a BOM character ?

lovasoa avatar May 20 '20 09:05 lovasoa

In https://github.com/netvl/xml-rs/pull/213 I propose a from_bytes() convenience wrapper for this purposes.

tp-m avatar Dec 17 '21 12:12 tp-m

let mut xmlcontent=std::io::read_to_string(xmlfile).unwrap().replace("\u{feff}", "");

arielmendoza avatar Apr 12 '23 11:04 arielmendoza

BOM is now supported.

kornelski avatar May 10 '23 22:05 kornelski