polars
polars copied to clipboard
Inconsistent Round behavior
Checks
- [X] I have checked that this issue has not already been reported.
- [X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
import polars as pl
df = pl.read_csv('AAPL.csv', has_header=False, try_parse_dates=True, new_columns=['timestamp',"open","high","low","close","volume"],
dtypes={"open": pl.Float32, "high": pl.Float32, "low": pl.Float32, "close": pl.Float32, "volume": pl.Float32})
df
timestamp | open | high | low | close | volume |
---|---|---|---|---|---|
datetime[μs] | f32 | f32 | f32 | f32 | f32 |
2005-01-03 08:00:00 | 0.9979 | 0.9984 | 0.9979 | 0.9984 | 45594.0 |
2005-01-03 08:02:00 | 0.9903 | 0.9903 | 0.9903 | 0.9903 | 354001.0 |
2005-01-03 08:03:00 | 0.9995 | 0.9996 | 0.9995 | 0.9996 | 19540.0 |
2005-01-03 08:04:00 | 1.0003 | 1.0026 | 1.0003 | 1.0026 | 187845.0 |
2005-01-03 08:07:00 | 1.0012 | 1.0012 | 1.001 | 1.001 | 58620.0 |
… | … | … | … | … | … |
2024-04-19 19:40:00 | 164.399994 | 164.399994 | 164.399994 | 164.399994 | 100.0 |
2024-04-19 19:43:00 | 164.430099 | 164.430099 | 164.430099 | 164.430099 | 600.0 |
2024-04-19 19:44:00 | 164.429993 | 164.440002 | 164.429993 | 164.440002 | 383.0 |
2024-04-19 19:47:00 | 164.479996 | 164.479996 | 164.479996 | 164.479996 | 445.0 |
2024-04-19 19:48:00 | 164.479996 | 164.479996 | 164.429993 | 164.449997 | 600.0 |
df.with_columns(
pl.col('open').round(3)
)
timestamp | open | high | low | close | volume |
---|---|---|---|---|---|
datetime[μs] | f32 | f32 | f32 | f32 | f32 |
2005-01-03 08:00:00 | 0.998 | 0.9984 | 0.9979 | 0.9984 | 45594.0 |
2005-01-03 08:02:00 | 0.99 | 0.9903 | 0.9903 | 0.9903 | 354001.0 |
2005-01-03 08:03:00 | 0.999 | 0.9996 | 0.9995 | 0.9996 | 19540.0 |
2005-01-03 08:04:00 | 1.0 | 1.0026 | 1.0003 | 1.0026 | 187845.0 |
2005-01-03 08:07:00 | 1.001 | 1.0012 | 1.001 | 1.001 | 58620.0 |
… | … | … | … | … | … |
2024-04-19 19:40:00 | 164.399994 | 164.399994 | 164.399994 | 164.399994 | 100.0 |
2024-04-19 19:43:00 | 164.429993 | 164.430099 | 164.430099 | 164.430099 | 600.0 |
2024-04-19 19:44:00 | 164.429993 | 164.440002 | 164.429993 | 164.440002 | 383.0 |
2024-04-19 19:47:00 | 164.479996 | 164.479996 | 164.479996 | 164.479996 | 445.0 |
2024-04-19 19:48:00 | 164.479996 | 164.479996 | 164.429993 | 164.449997 | 600.0 |
df.with_columns(
pl.col('open').round(1)
)
timestamp | open | high | low | close | volume |
---|---|---|---|---|---|
datetime[μs] | f32 | f32 | f32 | f32 | f32 |
2005-01-03 08:00:00 | 1.0 | 0.9984 | 0.9979 | 0.9984 | 45594.0 |
2005-01-03 08:02:00 | 1.0 | 0.9903 | 0.9903 | 0.9903 | 354001.0 |
2005-01-03 08:03:00 | 1.0 | 0.9996 | 0.9995 | 0.9996 | 19540.0 |
2005-01-03 08:04:00 | 1.0 | 1.0026 | 1.0003 | 1.0026 | 187845.0 |
2005-01-03 08:07:00 | 1.0 | 1.0012 | 1.001 | 1.001 | 58620.0 |
… | … | … | … | … | … |
2024-04-19 19:40:00 | 164.399994 | 164.399994 | 164.399994 | 164.399994 | 100.0 |
2024-04-19 19:43:00 | 164.399994 | 164.430099 | 164.430099 | 164.430099 | 600.0 |
2024-04-19 19:44:00 | 164.399994 | 164.440002 | 164.429993 | 164.440002 | 383.0 |
2024-04-19 19:47:00 | 164.5 | 164.479996 | 164.479996 | 164.479996 | 445.0 |
2024-04-19 19:48:00 | 164.5 | 164.479996 | 164.429993 | 164.449997 | 600.0 |
df.with_columns(
pl.col('open').round(2)
)
timestamp | open | high | low | close | volume |
---|---|---|---|---|---|
datetime[μs] | f32 | f32 | f32 | f32 | f32 |
2005-01-03 08:00:00 | 1.0 | 0.9984 | 0.9979 | 0.9984 | 45594.0 |
2005-01-03 08:02:00 | 0.99 | 0.9903 | 0.9903 | 0.9903 | 354001.0 |
2005-01-03 08:03:00 | 1.0 | 0.9996 | 0.9995 | 0.9996 | 19540.0 |
2005-01-03 08:04:00 | 1.0 | 1.0026 | 1.0003 | 1.0026 | 187845.0 |
2005-01-03 08:07:00 | 1.0 | 1.0012 | 1.001 | 1.001 | 58620.0 |
… | … | … | … | … | … |
2024-04-19 19:40:00 | 164.399994 | 164.399994 | 164.399994 | 164.399994 | 100.0 |
2024-04-19 19:43:00 | 164.429993 | 164.430099 | 164.430099 | 164.430099 | 600.0 |
2024-04-19 19:44:00 | 164.429993 | 164.440002 | 164.429993 | 164.440002 | 383.0 |
2024-04-19 19:47:00 | 164.479996 | 164.479996 | 164.479996 | 164.479996 | 445.0 |
2024-04-19 19:48:00 | 164.479996 | 164.479996 | 164.429993 | 164.449997 | 600.0 |
Log output
No response
Issue description
I'm using the round function in a column of floats with many decimals. However, sometimes the round is working as expected, sometimes it isnt.
Expected behavior
Round function consistently applied to all rows.
Installed versions
--------Version info---------
Polars: 0.20.18
Index type: UInt32
Platform: Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Python: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
----Optional dependencies----
adbc_driver_manager: <not installed>
cloudpickle: <not installed>
connectorx: <not installed>
deltalake: <not installed>
fastexcel: <not installed>
fsspec: 2024.3.1
gevent: <not installed>
hvplot: <not installed>
matplotlib: 3.8.4
nest_asyncio: 1.6.0
numpy: 1.26.4
openpyxl: <not installed>
pandas: <not installed>
pyarrow: <not installed>
pydantic: <not installed>
pyiceberg: <not installed>
pyxlsb: <not installed>
sqlalchemy: <not installed>
xlsx2csv: <not installed>
xlsxwriter: <not installed>
hi @ek-ex
I can reproduce this.
the problem though is not the round
function but rather the display/formatting of f32 data.
pl.Config.set_fmt_str_lengths(100)
DATA = [1.0, 1.2, 1.3, 1.4, 1.5, 100.1, 100.2, 100.3, 100.4, 100.5]
pl.DataFrame(
{"f32": DATA, "f64": DATA},
schema={"f32": pl.Float32, "f64": pl.Float64},
).with_columns(
f32_decimals=pl.col("f32").map_elements(lambda x: f"{x:.20f}"),
f64_decimals=pl.col("f64").map_elements(lambda x: f"{x:.20f}"),
)
# shape: (10, 4)
# ┌────────────┬───────┬──────────────────────────┬──────────────────────────┐
# │ f32 ┆ f64 ┆ f32_decimals ┆ f64_decimals │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ f32 ┆ f64 ┆ str ┆ str │
# ╞════════════╪═══════╪══════════════════════════╪══════════════════════════╡
# │ 1.0 ┆ 1.0 ┆ 1.00000000000000000000 ┆ 1.00000000000000000000 │
# │ 1.2 ┆ 1.2 ┆ 1.20000004768371582031 ┆ 1.19999999999999995559 │
# │ 1.3 ┆ 1.3 ┆ 1.29999995231628417969 ┆ 1.30000000000000004441 │
# │ 1.4 ┆ 1.4 ┆ 1.39999997615814208984 ┆ 1.39999999999999991118 │
# │ 1.5 ┆ 1.5 ┆ 1.50000000000000000000 ┆ 1.50000000000000000000 │
# │ 100.099998 ┆ 100.1 ┆ 100.09999847412109375000 ┆ 100.09999999999999431566 │
# │ 100.199997 ┆ 100.2 ┆ 100.19999694824218750000 ┆ 100.20000000000000284217 │
# │ 100.300003 ┆ 100.3 ┆ 100.30000305175781250000 ┆ 100.29999999999999715783 │
# │ 100.400002 ┆ 100.4 ┆ 100.40000152587890625000 ┆ 100.40000000000000568434 │
# │ 100.5 ┆ 100.5 ┆ 100.50000000000000000000 ┆ 100.50000000000000000000 │
# └────────────┴───────┴──────────────────────────┴──────────────────────────┘
💡 Important Concept
- the computer cannot precisely represent most floating point numbers like 1.2 or 100.3 but only approximate them
- the f64 type has double the bits so it can get closer to the real value and often "looks" correct
Solution
- not sure how/what is the solution here
- the "rounding" is technically correct but the representation is confusing... 🤔
Decimal
- when you want "perfect precision" the common approach is to use the
Decimal
type - however, polars current Decimal Type is still work in progress (round for Decimal not yet supported #15151)
I am not sure we need to take action on this. A DataFrame string representation shows you a concise information plot. If you require more control on how it is visualized you can set the float formatting options.
I am not sure if this is the same issue but this looks very inconsistent when i access the results
df = pl.DataFrame({"index": [1,2,3,4,5]})
df = df.with_columns(progress = pl.col("index") / pl.len())
df.get_column("progress").to_list()
This returns [0.2, 0.4, 0.6000000000000001, 0.8, 1.0]