[QST] Running cudf terribly slow
What is your question? I have a python code which calculates lots of numbers for varios custom dataclass objects. In the past I switched to polars in order to speed up. Now ı need to go faster, Therefore I try to implement a solution with a GPU. Core runs without any error in Pycharm, ut when I try to run it on terminal it gets error. Any help please
python3 -m cudf.pandas main.py
Batch_id: 17 Process ForkProcess-1: Traceback (most recent call last): File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap self.run() File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/concurrent/futures/process.py", line 240, in _process_worker call_item = call_queue.get(block=True) File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/multiprocessing/queues.py", line 122, in get return _ForkingPickler.loads(res) File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/site-packages/cudf/pandas/fast_slow_proxy.py", line 602, in __setstate__ unpickled_wrapped_obj = pickle.loads(state) File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/site-packages/cudf/core/abc.py", line 178, in host_deserialize frames = [ File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/site-packages/cudf/core/abc.py", line 179, in <listcomp> cudf.core.buffer.as_buffer(f) if c else f File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/site-packages/cudf/core/buffer/utils.py", line 136, in as_buffer return buffer_class(owner=owner_class.from_host_memory(data)) File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/site-packages/cudf/core/buffer/buffer.py", line 216, in from_host_memory buf = rmm.DeviceBuffer(ptr=ptr, size=size) File "device_buffer.pyx", line 88, in rmm._lib.device_buffer.DeviceBuffer.__cinit__ File "memory_resource.pyx", line 1087, in rmm._lib.memory_resource.get_current_device_resource File "/home/hakan/miniconda3/envs/bbx_gpu_env/lib/python3.10/site-packages/rmm/_cuda/gpu.py", line 58, in getDevice raise CUDARuntimeError(status) rmm._cuda.gpu.CUDARuntimeError: cudaErrorInitializationError: initialization error
Hey @Hakan439, thanks for raising this issue! Could you tell me the output of running nvidia-smi in your terminal?
Hi,
here it is:
@Hakan439 Since you say the core runs on pycharm, can you confirm if terminal you are getting the error and pycharm are using the same environment?
You could run which python and share the output of terminal and pycharm.
Which Python?
cuda, cupy, cudf
gpu enabled
Sample code:
results:
Somehow now gpu seems to be enabled but it DOES run very slow. In the sample code above, I reduced the amount of rows in dataframe to 1000 for the test. in cpu dataframe it took 0.12 seconds however in gpu, in 8 seconds. mine running as as eGPU btw I no not know whether it makes a difference or not. What am I missing?
In the pycharm console output I see a different environment from everything else, gputest vs bbx_gpu_env. Not sure if that is significant or intentional. Good to know that it is running now, though. What do you mean by "eGPU"? Regarding performance, what does your data look like? How many columns does it have? Do you observe similar issues if you have a single column?
In the pycharm console output I see a different environment from everything else,
gputestvsbbx_gpu_env. Not sure if that is significant or intentional. Good to know that it is running now, though. What do you mean by "eGPU"? Regarding performance, what does your data look like? How many columns does it have? Do you observe similar issues if you have a single column?
I tried with several virtual environments. The initial env was gputest. In mu single column benchmark, it rans slow. In my original python code, When I try to run it via cudf, it gives the error in the first message
I have a python code which calculates lots of numbers for varios custom dataclass objects
If cudf is working now but it is still slow, it is possible that your code is using custom dataclasses in a way that cudf simply doesn't support and so you end up falling back to running everything on the CPU. The relative slowdown you mentioned (0.12 vs 8 seconds) is pretty huge though. Have you tried running your code through the cudf.pandas profiler?