vaex icon indicating copy to clipboard operation
vaex copied to clipboard

Compute cumulative values in a groupby

Open jmehault opened this issue 4 years ago • 4 comments

Hi all,

First of all, thank you for this really cool lib. I discovered it few days ago, and I am starting to understand its philosophy.

I face a problem, I guess someone would give me some insight. I am trying to compute the cumulative sum of a variable based on a group. The straight solution df['cum_sum'] = df.groupby("class", agg={'x': 'cumsum'}) does not work because cumsum() is not supported in vaex. The workaround I use is to move to pandas, compute what I want and come back to vaex. Here is an example of my current solution :

import pandas as pd
df = pd.DataFrame({"class":np.repeat(["a", "b", "c"], 5), "x":np.arange(15)})
gr = df.groupby("class")
df["cum_x_by_class"] = gr.x.cumsum()

Do you think a solution with the current state of vaex is possible ? Thanks a lot

jmehault avatar Jan 02 '20 15:01 jmehault

Hi Jérémie,

thanks! I think we can do it, but it requires an in-memory copy of the data, and it also requires respecting the data order, so the performance could be quite bad. It may also require 2 passes over the data, and another one of the final result. Together with @JovanVeljanoski we are thinking about new aggregates, and it's great you've come up with this example, we'll take it into consideration.

Would it help by doing an approximation? You could do a binby, to do a 2d histogram (with sum aggregation), and do a sum in the final direction? sth like (didn't test it yet...):

# assume class is a categorical/ordinal
X = df.binby([df['class'], df['x']], agg={'x': 'sum'}, limits=[None, [0, 100]], shape=[None, 1000])
c = np.cumsum(X, axis=1)

cheers,

Maarten

maartenbreddels avatar Jan 06 '20 19:01 maartenbreddels

Thanks a lot for your answer Maarten, I see I have a lot of work to know every existing vaex's functions. In vaex-core 1.4.0 and on master, binby() does not support limits and shape parameters. Do I missed something ? Without those attributes, the final result 'c' is: <xarray.DataArray (statistic: 1, class_: 3, x: 15)> array([[[ 0., 1., 3., 6., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.], [ 0., 1., 3., 6., 10., 15., 21., 28., 36., 45., 45., 45., 45., 45., 45.], [ 0., 1., 3., 6., 10., 15., 21., 28., 36., 45., 55., 66., 78., 91., 105.]]]) Coordinates:

  • statistic (statistic) <U1 'x'
  • class_ (class_) <U1 'a' 'b' 'c'
  • x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 This might help, but still with a lot of handling to get the expected result. I am going to script the required transformations and see if it is accpetable in term of time and simplicity. I am using the result as a physical measurement and it is required to be precise.

Best regards, Jérémie

jmehault avatar Jan 07 '20 09:01 jmehault

Hi

any news on having cumsum implemented in vaex? I think this is really relevant when you consider time in your analysis.

best

Francesco

hyperfra avatar Jun 29 '20 22:06 hyperfra

Do we have cumsum/cumsum support in groups now in this wonderful library ?

meta-ks avatar Oct 31 '23 07:10 meta-ks