calamine icon indicating copy to clipboard operation
calamine copied to clipboard

_x000D_ kind of value in string cell should be unescaped

Open yorkz1994 opened this issue 1 year ago • 7 comments

image Take this excel value for example, the value is multi line. After run below code to print the cell value:

fn main() {
    let mut wb: Xlsx<_> = calamine::open_workbook("Book1.xlsx").unwrap();
    let ws = wb.worksheet_range("Sheet1").unwrap();
    let data = ws.get_value((0, 0)).unwrap();
    dbg!(data);
}

Output:

[src/main.rs:7:5] data = String(
    "ABC_x000D_\r\nDEF",        
)

Expected output:

[src/main.rs:7:5] data = String(
    "ABC\r\nDEF",        
)

Golang excelize library handle it correctly. Reference Book1.xlsx

yorkz1994 avatar Sep 24 '24 06:09 yorkz1994

If it helps here is how rust_xlsxwriter encodes these characters in the opposite direction:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/xmlwriter.rs#L204-L248

And here is a test file with each of the characters from 0..127:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings01.xlsx

However, as mentioned in the Reference link you need to also handle escaped literal strings which are prefixed by _x005F_. For example a string stored as _x005F_x0000_ in /xl/sharedStrings.xml would be displayed in Excel as _x0000_.

There is a test file for strings like that here:

https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings02.xlsx

jmcnamara avatar Sep 25 '24 20:09 jmcnamara

@jmcnamara

Thanks. This information is very useful. I check the code, it seems only _x00HH_ literals are escaped. If other valid _xHHHH_ literals are skipped, then when doing read, excel will not treat them as literal anymore. For example this *_x597D_*, if you don't escape it, when read back into excel, we got *好*, but we expect *_x597D_* back. image

yorkz1994 avatar Sep 26 '24 06:09 yorkz1994

If other valid _xHHHH_ literals are skipped, then when doing read, excel will not treat them as literal anymore.

You are correct. That is a bug in rust_xlsxwriter. :-| Update: fixed.

jmcnamara avatar Sep 26 '24 07:09 jmcnamara

I had a look at submitting a patch for this but it looks like the escaping is handled in quick_xml. I then looked at maybe using quick_xml::escape::unescape_with() but that seems intended for entities rather than general unescaping (as far as I can see).

I could look into it a bit more but overall I don't know if it is worth it. The escape _x000D_ == \r is probably the only one that a general user would encounter and maybe they could just handle it themselves. @tafia if you think it is worth fixing let me know and also how/where you think it should be fixed and I can look a bit more.

jmcnamara avatar Oct 28 '24 16:10 jmcnamara

actually calamine read office generated doc, no x000D, calamine read rust_xlsxwriter genrated doc, there is x000D,

skydig avatar Mar 13 '25 03:03 skydig

calamine read office generated doc, no x000D,

That is not correct. Here is a file created in Excel that contains _x000D and which calamine will read: https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings01.xlsx

There is also an example file in the initial bug report above.

jmcnamara avatar Mar 13 '25 08:03 jmcnamara

Any updates on this? I have a case where I am deserializing using serde so there is not simple way to sanitize every field.

jmsantosCSW avatar May 12 '25 14:05 jmsantosCSW

I had a look at submitting a patch for this but it looks like the escaping is handled in quick_xml. I then looked at maybe using quick_xml::escape::unescape_with() but that seems intended for entities rather than general unescaping (as far as I can see).

I could look into it a bit more but overall I don't know if it is worth it. The escape _x000D_ == \r is probably the only one that a general user would encounter and maybe they could just handle it themselves. @tafia if you think it is worth fixing let me know and also how/where you think it should be fixed and I can look a bit more.

@tafia there is still a question for you here (when you get a chance) on whether this can/should be handled in calamine or quick_xml.

jmcnamara avatar Jul 04 '25 14:07 jmcnamara

Fixed upstream in v0.31.0.

jmcnamara avatar Sep 27 '25 18:09 jmcnamara