fastexcel
fastexcel copied to clipboard
29.02 cell cast to 29.020000000000003 string
Steps to reproduce
See the Excel file decimal-numbers.xlsx. It contains a column with two decimals:
| Decimals |
|---|
| 28.14 |
| 29.02 |
Read the Excel file using fastexcel:
read_excel("decimal-numbers.xlsx").load_sheet_by_name("Sheet1").to_polars()
shape: (2, 1)
┌──────────┐
│ Decimals │
│ --- │
│ f64 │
╞══════════╡
│ 28.14 │
│ 29.02 │
└──────────┘
Looks fine.
Then read the Excel file while casting to strings:
read_excel("decimal-numbers.xlsx").load_sheet_by_name("Sheet1", dtypes={0: 'string'}).to_polars()
shape: (2, 1)
┌────────────────────┐
│ Decimals │
│ --- │
│ str │
╞════════════════════╡
│ 28.14 │
│ 29.020000000000003 │
└────────────────────┘
The expected result was that the strings would be the same as shown to the user in the Excel file:
shape: (2, 1)
┌──────────┐
│ Decimals │
│ --- │
│ str │
╞══════════╡
│ 28.14 │
│ 29.02 │
└──────────┘
Wrap-up
I understand this looks like an issue due to floating point precision, and I'm not sure if this:
- could be fixed in fastexcel
- could be fixed in the underlying calamine.
- cannot be fixed at all, since it's a fundamental property of the Excel file format or parsing process.
What's the motivation for filing this bug: In our system, we have highly heterogeneous data, so we have to read all Excel values as strings. However, if users see 29.02 in their Excel files, but 29.020000000000003 in our system, that's highly confusing and surprising to users.
What do you think?
Hello @severinh
Always great to have a clear bug issue with file and clear explanation thank you!
I checked quickly and indeed it feels hard to fix on fastexcel side
In src/data.rs, in create_string_array function we simply do
cell.as_string()
Even with
cell.get_float().map(|v| v.to_string())
or
cell.as_f64()
we get the same behavior with this floating precision issue and AFAIK there is no way for us to get more info on the original input. I'll try to dig on calamine side tonight
EDIT: quick check on calamine
inner: [
SharedString(
"Decimals",
),
Float(
28.14,
),
Float(
29.020000000000003,
),
],
when reading the xml content we get
BytesText { content: Borrowed("29.020000000000003") }
so I guess it goes even further directly in the xml content 😞
The floating point number in the Excel file is 29.020000000000003:
<row r="2" spans="1:1" x14ac:dyDescent="0.2">
<c r="A2" s="1">
<v>28.14</v>
</c>
</row>
<row r="3" spans="1:1" x14ac:dyDescent="0.2">
<c r="A3" s="1">
<v>29.020000000000003</v>
</c>
</row>
For the display (and storage) of floating point numbers Excel uses something very similar to the printf() format option %.16G:
$ python
>>> print("%.16G" % 29.020000000000003)
29.02
That format option isn't supported by Rust in std::fmt however.
Thanks @jmcnamara indeed it's really in the xml content itself.
Maybe we could use rust_decimal crate that implements already a "smart" Display
With
[dependencies]
rust_decimal = { version = "1.36.0", default-features = false }
use rust_decimal::prelude::*;
fn main() {
let x = 29.020000000000003_f64;
println!("x: {}", Decimal::from_f64(x).unwrap());
}
▶ cargo r -q
x: 29.02
WDYT?
For the display (and storage) of floating point numbers Excel uses something very similar to the
printf()format option%.16G:
I was wrong about this for the display part. Excel displays decimal numbers with 8 digits in a standard cell width and 10 digits in a wide cell. Like this:
| Format = 0.000000000 | Unformatted, wide cell | Unformatted, standard width |
|---|---|---|
| 1.123456789 | 1.123456789 | 1.123457 |
| 12.123456789 | 12.12345679 | 12.12346 |
| 123.123456789 | 123.1234568 | 123.1235 |
| 1234.123456789 | 1234.123457 | 1234.123 |
| 12345.123456789 | 12345.12346 | 12345.12 |
| 123456.123456789 | 123456.1235 | 123456.1 |
It also trims trailing zeros and the decimal point if there are no fractional part.
Again thank you for the info. Much appreciated. I reckon we should make a PR on calamine side. I can maybe do that this weekend
Huge thanks for the deep investigation into this issue, @PrettyWood and @jmcnamara! Learned a lot from reading this discussion.
I would trust you to make the right decisions if and where this should be changed. I also hope that this won't have any negative impact on other users of calamine and fastexcel (such as performance).
We chose to support it directly in fastexcel for now. It should have very little impact because it's only applied for float to string conversion