polars icon indicating copy to clipboard operation
polars copied to clipboard

Inaccurate cum_sum

Open ek-ex opened this issue 9 months ago • 2 comments

Checks

  • [X] I have checked that this issue has not already been reported.
  • [X] I have confirmed this bug exists on the latest version of Polars.

Reproducible example

df = pl.read_csv('volume.csv')
[volume.csv](https://github.com/pola-rs/polars/files/15125721/volume.csv)

df = df.with_columns(
    pl.col('volume').cum_sum().over('date').alias('cv'),
)
df.write_csv('cum_sum.csv')
[cum_sum.csv](https://github.com/pola-rs/polars/files/15125718/cum_sum.csv)

Log output

No response

Issue description

After computing the cum_sum of the 'volume' column, doing a manual validation in excel, I can see that there is some differences in the computed values. image

cum_sum.csv

Expected behavior

cum_sum should be exact.

Installed versions

--------Version info---------
Polars:               0.20.22
Index type:           UInt32
Platform:             Linux-6.5.0-28-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.3.1
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           3.8.4
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

ek-ex avatar Apr 26 '24 05:04 ek-ex

hi @ek-ex

cannot reproduce this with your example 🤔

pl.read_csv("cum_sum.csv").with_columns(
    cum_sum=pl.col("volume").cum_sum(),
    cum_sum_over=pl.col("volume").cum_sum().over("date"),
).filter(
    (pl.col("cum_sum") != pl.col("Manually computed cum_sum"))
    | (pl.col("cum_sum_over") != pl.col("Manually computed cum_sum"))
)
# zero rows df: all equal

Julian-J-S avatar Apr 26 '24 06:04 Julian-J-S

@JulianCologne can you share your output fille of the with_columns block?

ek-ex avatar Apr 27 '24 01:04 ek-ex

I get same output from NumPy and Polars for cumulative sum on the volume column.

itamarst avatar Jun 11 '24 17:06 itamarst

I also confirm polars cum_sum on this file.

deanm0000 avatar Jun 11 '24 20:06 deanm0000