_x000D_ kind of value in string cell should be unescaped
Take this excel value for example, the value is multi line.
After run below code to print the cell value:
fn main() {
let mut wb: Xlsx<_> = calamine::open_workbook("Book1.xlsx").unwrap();
let ws = wb.worksheet_range("Sheet1").unwrap();
let data = ws.get_value((0, 0)).unwrap();
dbg!(data);
}
Output:
[src/main.rs:7:5] data = String(
"ABC_x000D_\r\nDEF",
)
Expected output:
[src/main.rs:7:5] data = String(
"ABC\r\nDEF",
)
Golang excelize library handle it correctly. Reference Book1.xlsx
If it helps here is how rust_xlsxwriter encodes these characters in the opposite direction:
https://github.com/jmcnamara/rust_xlsxwriter/blob/main/src/xmlwriter.rs#L204-L248
And here is a test file with each of the characters from 0..127:
https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings01.xlsx
However, as mentioned in the Reference link you need to also handle escaped literal strings which are prefixed by _x005F_. For example a string stored as _x005F_x0000_ in /xl/sharedStrings.xml would be displayed in Excel as _x0000_.
There is a test file for strings like that here:
https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings02.xlsx
@jmcnamara
Thanks. This information is very useful. I check the code, it seems only _x00HH_ literals are escaped.
If other valid _xHHHH_ literals are skipped, then when doing read, excel will not treat them as literal anymore.
For example this *_x597D_*, if you don't escape it, when read back into excel, we got *好*, but we expect *_x597D_* back.
If other valid
_xHHHH_literals are skipped, then when doing read, excel will not treat them as literal anymore.
You are correct. That is a bug in rust_xlsxwriter. :-| Update: fixed.
I had a look at submitting a patch for this but it looks like the escaping is handled in quick_xml. I then looked at maybe using quick_xml::escape::unescape_with() but that seems intended for entities rather than general unescaping (as far as I can see).
I could look into it a bit more but overall I don't know if it is worth it. The escape _x000D_ == \r is probably the only one that a general user would encounter and maybe they could just handle it themselves. @tafia if you think it is worth fixing let me know and also how/where you think it should be fixed and I can look a bit more.
actually calamine read office generated doc, no x000D, calamine read rust_xlsxwriter genrated doc, there is x000D,
calamine read office generated doc, no x000D,
That is not correct. Here is a file created in Excel that contains _x000D and which calamine will read: https://github.com/jmcnamara/rust_xlsxwriter/blob/main/tests/input/shared_strings01.xlsx
There is also an example file in the initial bug report above.
Any updates on this? I have a case where I am deserializing using serde so there is not simple way to sanitize every field.
I had a look at submitting a patch for this but it looks like the escaping is handled in
quick_xml. I then looked at maybe usingquick_xml::escape::unescape_with()but that seems intended for entities rather than general unescaping (as far as I can see).I could look into it a bit more but overall I don't know if it is worth it. The escape
_x000D_==\ris probably the only one that a general user would encounter and maybe they could just handle it themselves. @tafia if you think it is worth fixing let me know and also how/where you think it should be fixed and I can look a bit more.
@tafia there is still a question for you here (when you get a chance) on whether this can/should be handled in calamine or quick_xml.
Fixed upstream in v0.31.0.