cudf icon indicating copy to clipboard operation
cudf copied to clipboard

[QST] aggregate function that operates on vector(array of numeric) data

Open Rhett-Ying opened this issue 1 year ago • 3 comments

What is your question? I am wondering if cudf has native or built-in support for aggregate function that run against vector data. Namley, text/image embeddings are stored in the column of csv/parquet file. And I'd like to run various aggregate functions such as mean, max and so on. All these operations are element-wise, namely, it returns the mean of all the values in same index and return an array with same lenght. What's more, I'd like to run K-Nearest-Neighbor search as well.

If not natively supported, how to achieve these operations with performance efficient?

example code:

import cudf
import numpy as np
import pandas as pd

# Sample DataFrame with Pandas to cuDF conversion
data = {
    'category': ['A', 'A', 'B', 'B'],
    'values': [np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([7, 8, 9]), np.array([10, 11, 12])]
}
pdf = pd.DataFrame(data)
df = cudf.DataFrame.from_pandas(pdf)

result = df.groupby('category').agg({'values': ['sum', 'mean']})

print(result)

# Expected output
'''
category
A     [2.5, 3.5, 4.5]
B    [8.5, 9.5, 10.5]
Name: values, dtype: object
'''

Rhett-Ying avatar May 14 '24 03:05 Rhett-Ying

This kind of operations is not natively supported, unfortunately. The fundamental issue is that pandas allows you to put arbitrary objects into a Series/DataFrame and it will run Python operations on them. In this case, since you put numpy arrays in, pandas will happily just leave them as numpy arrays and use binary operations on numpy array so this works as expected. cudf does not support arbitrary objects in this way, so we have to be a bit more clever about rearranging the data ourselves to handle this kind of operation. Per-row array data is supported through the list dtype, which is what your'e getting in the from_pandas call in your snippet. To work with that in vectorized fashion, the typical approach is to use the explode method, which flattens out the data. Here is a snippet that gives you an essentially equivalent result (slight differences in column names etc):

import cudf
import numpy as np
import pandas as pd

# Sample DataFrame with Pandas to cuDF conversion
data = {
    'category': ['A', 'A', 'B', 'B'],
    'values': [np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([7, 8, 9]), np.array([10, 11, 12])]
}
pdf = pd.DataFrame(data)
df = cudf.DataFrame.from_pandas(pdf)

print("pandas result")
print(pdf.groupby('category').agg({'values': ['sum', 'mean']}))
print()

exploded_values = df[["values"]].explode("values")
df = df[["category"]].merge(exploded_values, left_index=True, right_index=True)
df["index"] = np.tile(np.arange(3), 4)

print("cudf result")
print(df.groupby(["category", "index"]).agg({"values": ["sum", "mean"]}).groupby("category").collect())

This outputs:

pandas result
                values                  
                   sum              mean
category                                
A            [5, 7, 9]   [2.5, 3.5, 4.5]
B         [17, 19, 21]  [8.5, 9.5, 10.5]

cudf result
         (values, sum)    (values, mean)
category                                
A            [5, 9, 7]   [2.5, 4.5, 3.5]
B         [19, 17, 21]  [9.5, 8.5, 10.5]

vyasr avatar May 20 '24 23:05 vyasr

@vyasr Thanks for your suggestion. The suggestion you gave above is equivalent to splitting array into separate columns, then apply sum()/mean() on each column, and merge the output back into an array?

Rhett-Ying avatar May 22 '24 03:05 Rhett-Ying

Yes, that is basically equivalent. You cannot operate on the numpy arrays directly, but assuming they are all of the same length you could split them into multiple columns if you have control of that on construction. Otherwise the list-based approach I showed is the way you could process it if you have to take the numpy array-based inputs from pandas as-is.

vyasr avatar May 23 '24 03:05 vyasr

@Rhett-Ying does the above solution address your needs?

vyasr avatar May 30 '24 00:05 vyasr

@vyasr Thanks for your suggestion. One major concern for me is the performance. Especially when I want to apply more advanced operations on vector data such as K-Nearest-Neighbor Search. Should I leverage tools like CUVS for operations on vector data?

Rhett-Ying avatar May 30 '24 00:05 Rhett-Ying

For more advanced operations, yes it will depend. If operators already exist in other libraries like cuVS those will almost certainly be faster than any apply-based solution you come up with in just cudf. In general, if you are trying to do vectorized operations on homogeneous vectors (i.e. something that would fit in a square matrix, or a higher-order tensor, and not needing a ragged list), you will likely have better luck implementing those types of operations performantly in cupy. That's also true on the host: you would probably get better performance using numpy operations than pandas operations for something like a manual kNN implementation since with numpy you can devolve directly to its vectorized operations (implemented in C) whereas with pandas you introduce some extra layers of Python.

vyasr avatar Jun 03 '24 16:06 vyasr

Going to close as resolved, but feel free to follow up if there are more questions.

vyasr avatar Jun 24 '24 18:06 vyasr