cudf
cudf copied to clipboard
[PERF] looping through dataframe is 100x slower than when running without cudf
Describe the bug I have a case where I loop through each element in a dataframe and call a function for each element. When running with cudf.pandas, this takes on the order of 100x longer time than when running with just pandas. I recognize that best practices is to write vectorized functions but there are cases where it is just easier to loop through each element. I don't expect speedup compared to the non-cudf implementation but it would be good if there wasn't a huge slowdown.
Steps/Code to reproduce bug Code run in a Jupyter notebook:
%load_ext cudf.pandas
import pandas as pd
import numpy as np
matrix = np.zeros((100, 100))
df = pd.DataFrame(matrix)
%%time
def func(acc, val):
acc += val
return acc
acc = 0.0
for col in df.columns:
for idx in df.index:
val = df[col][idx]
acc = func(acc, val)
print(acc)
Expected behavior When running without cudf this takes 60ms. When running with cudf it takes 10 seconds. I would expect performance with cudf to be comparable to performance without cudf.
Environment overview (please complete the following information) -Bare-metal -PIP install
Environment details Not sure where to find that script. Here are my basic setup: Platform: x86 + A100 GPU. Ubuntu 22.04.4 LTS cuDF: Name: cudf-cu12 Version: 24.6.1 CUDA: Cuda compilation tools, release 12.3, V12.3.107 Python: Python 3.10.12 Running in a Jupyter notebook
Additional context Add any other context about the problem here.
Hi @magnus-ekman ,
Thank you for the report. This is an issue with cudf when we try to access the scalar values from a column. They are inherently slower when compared to pandas. Here is an example:
# Pandas
In [1]: import pandas as pd
In [2]: s = pd.Series([10, 1, 2, 3, 4, 5])
In [3]: %timeit s[2]
4.73 μs ± 12.3 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
# cudf
In [1]: import cudf
In [2]: s = cudf.Series([10, 1, 2, 3, 4, 5])
In [3]: %timeit s[2]
1.66 ms ± 1.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This slow-down is being amplified in your example. This is something we at Nvidia are actively working on to alleviate.
However, as a temporary workaround you can disable using GPU for an instruction using this:
from cudf.pandas.module_accelerator import disable_module_accelerator
with disable_module_accelerator():
# your pandas code
Thanks. I have a (perhaps silly) question on the workaround that is related to this slowdown. When I work in a Jupyter notebook, I like to simply type "df" in a cell and execute the cell to get the DataFrame printed in a nicely formatted way. Doing so is super slow with cudf. If I try to apply your suggested workaround, I don't get a print-out. It works if I instead do "print(df)", but it will not be as nicely formatted. Any ideas of how to solve this?
@magnus-ekman I think that issue with showing df might be the same as #15747.
@galipremsagar Maybe we can work on accelerating the fancy repr in the nearer term, since it should be easier to solve than the broader problem of scalar access.