cudf [PERF] looping through dataframe is 100x slower than when running without cudf

Describe the bug I have a case where I loop through each element in a dataframe and call a function for each element. When running with cudf.pandas, this takes on the order of 100x longer time than when running with just pandas. I recognize that best practices is to write vectorized functions but there are cases where it is just easier to loop through each element. I don't expect speedup compared to the non-cudf implementation but it would be good if there wasn't a huge slowdown.

Steps/Code to reproduce bug Code run in a Jupyter notebook:

%load_ext cudf.pandas
import pandas as pd
import numpy as np
matrix = np.zeros((100, 100))
df = pd.DataFrame(matrix)

%%time
def func(acc, val):
    acc += val
    return acc    
acc = 0.0
for col in df.columns:
    for idx in df.index:
        val = df[col][idx]
        acc = func(acc, val)
print(acc)

Expected behavior When running without cudf this takes 60ms. When running with cudf it takes 10 seconds. I would expect performance with cudf to be comparable to performance without cudf.

Environment overview (please complete the following information) -Bare-metal -PIP install

Environment details Not sure where to find that script. Here are my basic setup: Platform: x86 + A100 GPU. Ubuntu 22.04.4 LTS cuDF: Name: cudf-cu12 Version: 24.6.1 CUDA: Cuda compilation tools, release 12.3, V12.3.107 Python: Python 3.10.12 Running in a Jupyter notebook

Additional context Add any other context about the problem here.

Aug 03 '24 17:08 magnus-ekman

Hi @magnus-ekman ,

Thank you for the report. This is an issue with cudf when we try to access the scalar values from a column. They are inherently slower when compared to pandas. Here is an example:

# Pandas
In [1]: import pandas as pd

In [2]: s = pd.Series([10, 1, 2, 3, 4, 5])

In [3]: %timeit s[2]
4.73 μs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

# cudf
In [1]: import cudf

In [2]: s = cudf.Series([10, 1, 2, 3, 4, 5])

In [3]: %timeit s[2]
1.66 ms ± 1.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This slow-down is being amplified in your example. This is something we at Nvidia are actively working on to alleviate.

However, as a temporary workaround you can disable using GPU for an instruction using this:

from cudf.pandas.module_accelerator import disable_module_accelerator
with disable_module_accelerator():
    # your pandas code

Aug 13 '24 16:08 galipremsagar

Thanks. I have a (perhaps silly) question on the workaround that is related to this slowdown. When I work in a Jupyter notebook, I like to simply type "df" in a cell and execute the cell to get the DataFrame printed in a nicely formatted way. Doing so is super slow with cudf. If I try to apply your suggested workaround, I don't get a print-out. It works if I instead do "print(df)", but it will not be as nicely formatted. Any ideas of how to solve this?

Aug 13 '24 17:08 magnus-ekman

@magnus-ekman I think that issue with showing df might be the same as #15747.

@galipremsagar Maybe we can work on accelerating the fancy repr in the nearer term, since it should be easier to solve than the broader problem of scalar access.

Aug 13 '24 18:08 bdice

cudf cudf copied to clipboard

[PERF] looping through dataframe is 100x slower than when running without cudf

cudf
cudf copied to clipboard