vaex
vaex copied to clipboard
Compute cumulative values in a groupby
Hi all,
First of all, thank you for this really cool lib. I discovered it few days ago, and I am starting to understand its philosophy.
I face a problem, I guess someone would give me some insight.
I am trying to compute the cumulative sum of a variable based on a group.
The straight solution
df['cum_sum'] = df.groupby("class", agg={'x': 'cumsum'})
does not work because cumsum() is not supported in vaex.
The workaround I use is to move to pandas, compute what I want and come back to vaex. Here is an example of my current solution :
import pandas as pd
df = pd.DataFrame({"class":np.repeat(["a", "b", "c"], 5), "x":np.arange(15)})
gr = df.groupby("class")
df["cum_x_by_class"] = gr.x.cumsum()
Do you think a solution with the current state of vaex is possible ? Thanks a lot
Hi Jérémie,
thanks! I think we can do it, but it requires an in-memory copy of the data, and it also requires respecting the data order, so the performance could be quite bad. It may also require 2 passes over the data, and another one of the final result. Together with @JovanVeljanoski we are thinking about new aggregates, and it's great you've come up with this example, we'll take it into consideration.
Would it help by doing an approximation? You could do a binby, to do a 2d histogram (with sum aggregation), and do a sum in the final direction? sth like (didn't test it yet...):
# assume class is a categorical/ordinal
X = df.binby([df['class'], df['x']], agg={'x': 'sum'}, limits=[None, [0, 100]], shape=[None, 1000])
c = np.cumsum(X, axis=1)
cheers,
Maarten
Thanks a lot for your answer Maarten, I see I have a lot of work to know every existing vaex's functions. In vaex-core 1.4.0 and on master, binby() does not support limits and shape parameters. Do I missed something ? Without those attributes, the final result 'c' is: <xarray.DataArray (statistic: 1, class_: 3, x: 15)> array([[[ 0., 1., 3., 6., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10., 10.], [ 0., 1., 3., 6., 10., 15., 21., 28., 36., 45., 45., 45., 45., 45., 45.], [ 0., 1., 3., 6., 10., 15., 21., 28., 36., 45., 55., 66., 78., 91., 105.]]]) Coordinates:
- statistic (statistic) <U1 'x'
- class_ (class_) <U1 'a' 'b' 'c'
- x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 This might help, but still with a lot of handling to get the expected result. I am going to script the required transformations and see if it is accpetable in term of time and simplicity. I am using the result as a physical measurement and it is required to be precise.
Best regards, Jérémie
Hi
any news on having cumsum implemented in vaex? I think this is really relevant when you consider time in your analysis.
best
Francesco
Do we have cumsum/cumsum support in groups now in this wonderful library ?