rust-libxml icon indicating copy to clipboard operation
rust-libxml copied to clipboard

Encoding issue with libxml 2.11.1, 2.11.2, 2.11.3 (OK with libxml 2.11.0)

Open jcamiel opened this issue 2 years ago • 1 comments

Hi,

I've a strange encoding issue started with libxml 2.11.1+, (released a week ago https://gitlab.gnome.org/GNOME/libxml2/-/tags) with libxml rust crate 0.3.2.

My sample:

  • I've the following html document <data>café</data>
  • I evaluate the following xpath expression normalize-space(//data).

Sample code:

use std::ffi::CStr;
use std::os::raw;
use libxml::parser::{Parser, ParserOptions};
use libxml::xpath::Context;

fn main() {
    let parser = Parser::default_html();
    let options = ParserOptions { encoding: Some("utf-8"), ..Default::default()};
    let data = "<data>café</data>";
    let doc = parser.parse_string_with_options(data, options).unwrap();

    let context = Context::new(&doc).unwrap();
    let result = context.evaluate("normalize-space(//data)").unwrap();

    assert_eq!(unsafe { *result.ptr }.type_, libxml::bindings::xmlXPathObjectType_XPATH_STRING);
    let value = unsafe { *result.ptr }.stringval;
    let value = value as *const raw::c_char;
    let value = unsafe { CStr::from_ptr(value) };
    let value = value.to_string_lossy();
    println!("{value}")
}

With libxml 2.11.0, the value printed is café, with libxml 2.11.1 the value printed is café:

  • With libxml 2.11.0:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café
  • With libxml 2.11.3:
$ export LIBXML2=/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib/libxml2.2.dylib
$ cargo clean && cargo run
$ café

I've the impression that the encoding value of ParserOptions is not evaluated correctly through the crate (note: to reproduce the bug, you've to use Parser::default_html() and not Parser::default())

To confirm this, I've tested the "equivalent" code in plain C with libxml 2.11.3:

#include <string.h>
#include <libxml/HTMLparser.h>
#include <libxml/xpath.h>

int main() {
    xmlDocPtr doc = NULL;
    xmlXPathContextPtr context = NULL;
    xmlXPathObjectPtr result = NULL;

    // <data>café</data> in utf-8:
    char data[] = (char[]) {0x3c, 0x64, 0x61, 0x74, 0x61, 0x3e, 0x63, 0x61, 0x66, 0xc3, 0xa9, 0x3c, 0x2f, 0x64, 0x61,
                            0x74, 0x61, 0x3e};
    doc = htmlReadMemory(data, strlen(data), NULL, "utf-8",
                         HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);

    // Creating result request
    context = xmlXPathNewContext(doc);
    result = xmlXPathEvalExpression((const unsigned char *) "normalize-space(//data)", context);
    if (result->type == XPATH_STRING) {
        printf("%s\n", result->stringval);
    }

    xmlXPathFreeObject(result);
    xmlXPathFreeContext(context);
    xmlFreeDoc(doc);
    return 0;
}
  • With libxml 2.11.0:
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.0/lib -l xml2 test.c
$ ./a.out
$ café
  • With libxml 2.11.3:
$ gcc -L/Users/jc/Documents/Dev/libxml/libxml2-2.11.3/lib -l xml2 test.c
$ ./a.out
$ café

My suspision is in https://github.com/KWARC/rust-libxml/blob/a10a5a68a293de992c3724f8cbb4003a5f4fe39c/src/parser.rs#L292

When I debug the following code:

   // Process encoding.
    let encoding_cstring: Option<CString> =
      parser_options.encoding.map(|v| CString::new(v).unwrap());
    let encoding_ptr = match encoding_cstring {
      Some(v) => v.as_ptr(),
      None => DEFAULT_ENCODING,
    };

    // Process url.
    let url_ptr = DEFAULT_URL;

If parser encoding is initialized with Some("utf-8"), encoding_ptr is not valid just before // Process url (it points to a null char). So the call to the binding htmlReadMemory is made with no encoding... The unsafe part of the code is my Rust limit of understanding so I'm unable to see if there is something bad here. I hope my issue is clear, and, I should have started by this, thank you for your work on this crate !

Regards,

Jc

jcamiel avatar May 11 '23 15:05 jcamiel

I hit this one as well. It think it is caused by libxml2 changing the default encoding when NULL is passed from utf-8 to ISO-8859-1 which apparently is more correct. But its breaking a lot of real world use cases.

So maybe the encoding override in this crate never worked and nobody noticed since the default was utf-8 anyway?

https://gitlab.gnome.org/GNOME/libxml2/-/issues/570

@jcamiel thanks for figuring out a temporary workaround

jangernert avatar Aug 10 '23 09:08 jangernert