vaex icon indicating copy to clipboard operation
vaex copied to clipboard

[FEATURE-REQUEST]Like Pandas cut

Open heyuqi1970 opened this issue 3 years ago • 3 comments

Description I want to group an numeric type column by interval of value, which is similar with using pandas cut function. In pandas I can use cut function to create an interval label column, and then group by new column:

bins = [-150, -110, -100, -90, -80, -70, -30]
data["rsrp_range"] = pd.cut(data["OptimalAvgRSRP"], bins=bins, labels=label, right=True)
pdf = data.groupby(data["rsrp_range"]).agg({"rsrp_range": "count"})

Does Vaex have similar function?

heyuqi1970 avatar Feb 25 '21 01:02 heyuqi1970

Hi,

good question.

We don't have cut implemented, but we do wrap https://numpy.org/doc/stable/reference/generated/numpy.searchsorted.html

import vaex
df = vaex.from_arrays(x=vaex.vrange(0,10))
bins = np.array([0, 3, 10])  # make sure to create a numpy array from this
df['x_bin'] = df.func.searchsorted(bins, df.x)
df['x_name'] = df.x_bin.map({0: 'small', 1: 'medium', 2: 'large'})
df

image

@JovanVeljanoski should we implement cut using this? Or should we have this in the docs somewhere? @heyuqi1970 or would you like to give this a try?

Regards,

Maarten

maartenbreddels avatar Apr 01 '21 09:04 maartenbreddels

Thanks for your reply, I will try this.

heyuqi1970 avatar Apr 06 '21 02:04 heyuqi1970

This is what I wrote and used in my project:

def custom_cut(dfv, col, bins, labels=None, right=True):
    # Sort the unique bin edges
    sorted_bins = np.sort(np.unique(bins))
    
    # Use searchsorted to find the bin indices for each element in x
    bin_indices = dfv.func.searchsorted(sorted_bins, dfv[col], side='right' if right else 'left')
    
    # Adjust the bin indices to handle out-of-bounds cases
    bin_indices = bin_indices.clip(0, len(sorted_bins) - 1)
    
    # Apply the labels if provided
    if labels is not None:
        result = bin_indices.map(dict(zip(range(len(labels)), labels)))
    else:
        result = bin_indices
    
    return result

and

custom_cut(dfv, 'x', bins, labels=labels, right=False) gives:

0   small
1  medium
2  medium
3  medium
4   large
5   large
6   large
7   large
8   large
9   large

msat59 avatar Aug 22 '23 22:08 msat59