polars icon indicating copy to clipboard operation
polars copied to clipboard

Inconsistent Round behavior

Open ek-ex opened this issue 9 months ago • 2 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example


import polars as pl

df = pl.read_csv('AAPL.csv', has_header=False, try_parse_dates=True, new_columns=['timestamp',"open","high","low","close","volume"], 
                    dtypes={"open": pl.Float32, "high": pl.Float32, "low": pl.Float32, "close": pl.Float32, "volume": pl.Float32})
df
timestamp open high low close volume
datetime[μs] f32 f32 f32 f32 f32
2005-01-03 08:00:00 0.9979 0.9984 0.9979 0.9984 45594.0
2005-01-03 08:02:00 0.9903 0.9903 0.9903 0.9903 354001.0
2005-01-03 08:03:00 0.9995 0.9996 0.9995 0.9996 19540.0
2005-01-03 08:04:00 1.0003 1.0026 1.0003 1.0026 187845.0
2005-01-03 08:07:00 1.0012 1.0012 1.001 1.001 58620.0
2024-04-19 19:40:00 164.399994 164.399994 164.399994 164.399994 100.0
2024-04-19 19:43:00 164.430099 164.430099 164.430099 164.430099 600.0
2024-04-19 19:44:00 164.429993 164.440002 164.429993 164.440002 383.0
2024-04-19 19:47:00 164.479996 164.479996 164.479996 164.479996 445.0
2024-04-19 19:48:00 164.479996 164.479996 164.429993 164.449997 600.0
df.with_columns(
    pl.col('open').round(3)
)
timestamp open high low close volume
datetime[μs] f32 f32 f32 f32 f32
2005-01-03 08:00:00 0.998 0.9984 0.9979 0.9984 45594.0
2005-01-03 08:02:00 0.99 0.9903 0.9903 0.9903 354001.0
2005-01-03 08:03:00 0.999 0.9996 0.9995 0.9996 19540.0
2005-01-03 08:04:00 1.0 1.0026 1.0003 1.0026 187845.0
2005-01-03 08:07:00 1.001 1.0012 1.001 1.001 58620.0
2024-04-19 19:40:00 164.399994 164.399994 164.399994 164.399994 100.0
2024-04-19 19:43:00 164.429993 164.430099 164.430099 164.430099 600.0
2024-04-19 19:44:00 164.429993 164.440002 164.429993 164.440002 383.0
2024-04-19 19:47:00 164.479996 164.479996 164.479996 164.479996 445.0
2024-04-19 19:48:00 164.479996 164.479996 164.429993 164.449997 600.0
df.with_columns(
    pl.col('open').round(1)
)
timestamp open high low close volume
datetime[μs] f32 f32 f32 f32 f32
2005-01-03 08:00:00 1.0 0.9984 0.9979 0.9984 45594.0
2005-01-03 08:02:00 1.0 0.9903 0.9903 0.9903 354001.0
2005-01-03 08:03:00 1.0 0.9996 0.9995 0.9996 19540.0
2005-01-03 08:04:00 1.0 1.0026 1.0003 1.0026 187845.0
2005-01-03 08:07:00 1.0 1.0012 1.001 1.001 58620.0
2024-04-19 19:40:00 164.399994 164.399994 164.399994 164.399994 100.0
2024-04-19 19:43:00 164.399994 164.430099 164.430099 164.430099 600.0
2024-04-19 19:44:00 164.399994 164.440002 164.429993 164.440002 383.0
2024-04-19 19:47:00 164.5 164.479996 164.479996 164.479996 445.0
2024-04-19 19:48:00 164.5 164.479996 164.429993 164.449997 600.0

df.with_columns(
    pl.col('open').round(2)
)
timestamp open high low close volume
datetime[μs] f32 f32 f32 f32 f32
2005-01-03 08:00:00 1.0 0.9984 0.9979 0.9984 45594.0
2005-01-03 08:02:00 0.99 0.9903 0.9903 0.9903 354001.0
2005-01-03 08:03:00 1.0 0.9996 0.9995 0.9996 19540.0
2005-01-03 08:04:00 1.0 1.0026 1.0003 1.0026 187845.0
2005-01-03 08:07:00 1.0 1.0012 1.001 1.001 58620.0
2024-04-19 19:40:00 164.399994 164.399994 164.399994 164.399994 100.0
2024-04-19 19:43:00 164.429993 164.430099 164.430099 164.430099 600.0
2024-04-19 19:44:00 164.429993 164.440002 164.429993 164.440002 383.0
2024-04-19 19:47:00 164.479996 164.479996 164.479996 164.479996 445.0
2024-04-19 19:48:00 164.479996 164.479996 164.429993 164.449997 600.0

Log output

No response

Issue description

I'm using the round function in a column of floats with many decimals. However, sometimes the round is working as expected, sometimes it isnt.

Expected behavior

Round function consistently applied to all rows.

Installed versions

--------Version info---------
Polars:               0.20.18
Index type:           UInt32
Platform:             Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

ek-ex avatar Apr 26 '24 02:04 ek-ex

hi @ek-ex

I can reproduce this.

the problem though is not the round function but rather the display/formatting of f32 data.

pl.Config.set_fmt_str_lengths(100)

DATA = [1.0, 1.2, 1.3, 1.4, 1.5, 100.1, 100.2, 100.3, 100.4, 100.5]

pl.DataFrame(
    {"f32": DATA, "f64": DATA},
    schema={"f32": pl.Float32, "f64": pl.Float64},
).with_columns(
    f32_decimals=pl.col("f32").map_elements(lambda x: f"{x:.20f}"),
    f64_decimals=pl.col("f64").map_elements(lambda x: f"{x:.20f}"),
)

# shape: (10, 4)
# ┌────────────┬───────┬──────────────────────────┬──────────────────────────┐
# │ f32        ┆ f64   ┆ f32_decimals             ┆ f64_decimals             │
# │ ---        ┆ ---   ┆ ---                      ┆ ---                      │
# │ f32        ┆ f64   ┆ str                      ┆ str                      │
# ╞════════════╪═══════╪══════════════════════════╪══════════════════════════╡
# │ 1.0        ┆ 1.0   ┆ 1.00000000000000000000   ┆ 1.00000000000000000000   │
# │ 1.2        ┆ 1.2   ┆ 1.20000004768371582031   ┆ 1.19999999999999995559   │
# │ 1.3        ┆ 1.3   ┆ 1.29999995231628417969   ┆ 1.30000000000000004441   │
# │ 1.4        ┆ 1.4   ┆ 1.39999997615814208984   ┆ 1.39999999999999991118   │
# │ 1.5        ┆ 1.5   ┆ 1.50000000000000000000   ┆ 1.50000000000000000000   │
# │ 100.099998 ┆ 100.1 ┆ 100.09999847412109375000 ┆ 100.09999999999999431566 │
# │ 100.199997 ┆ 100.2 ┆ 100.19999694824218750000 ┆ 100.20000000000000284217 │
# │ 100.300003 ┆ 100.3 ┆ 100.30000305175781250000 ┆ 100.29999999999999715783 │
# │ 100.400002 ┆ 100.4 ┆ 100.40000152587890625000 ┆ 100.40000000000000568434 │
# │ 100.5      ┆ 100.5 ┆ 100.50000000000000000000 ┆ 100.50000000000000000000 │
# └────────────┴───────┴──────────────────────────┴──────────────────────────┘

💡 Important Concept

  • the computer cannot precisely represent most floating point numbers like 1.2 or 100.3 but only approximate them
  • the f64 type has double the bits so it can get closer to the real value and often "looks" correct

Solution

  • not sure how/what is the solution here
  • the "rounding" is technically correct but the representation is confusing... 🤔

Decimal

  • when you want "perfect precision" the common approach is to use the Decimal type
  • however, polars current Decimal Type is still work in progress (round for Decimal not yet supported #15151)

Julian-J-S avatar Apr 26 '24 06:04 Julian-J-S

I am not sure we need to take action on this. A DataFrame string representation shows you a concise information plot. If you require more control on how it is visualized you can set the float formatting options.

ritchie46 avatar Apr 26 '24 08:04 ritchie46

I am not sure if this is the same issue but this looks very inconsistent when i access the results

df = pl.DataFrame({"index": [1,2,3,4,5]})
df = df.with_columns(progress = pl.col("index") / pl.len())
df.get_column("progress").to_list()

This returns [0.2, 0.4, 0.6000000000000001, 0.8, 1.0]

yusufuyanik1 avatar May 08 '24 14:05 yusufuyanik1