awkward icon indicating copy to clipboard operation
awkward copied to clipboard

Issues with `min()` and `argmin()` with `cuda` backend

Open kmohrman opened this issue 7 months ago • 5 comments

Version of Awkward Array

2.8.3

Description and code to reproduce

When trying to use ak.min() and ak.argmin() with the cuda backend, I am seeing incorrect behavior for ak.min(), and and a CuPyKernel not found error for ak.argmin().

This code reproduces the issues:

import awkward as ak

myarr = [[1.1,2.1],[1.2,2.2,-3.2]]
myarr_cpu = ak.Array(myarr,backend="cpu")
myarr_gpu = ak.Array(myarr,backend="cuda")

print("myarr:",myarr)
print("min (cpu):", ak.min(myarr_cpu,axis=1))
print("min (gpu):", ak.min(myarr_gpu,axis=1))
print("min (cpu):", ak.argmin(myarr_cpu,axis=1))
print("min (gpu):", ak.argmin(myarr_gpu,axis=1))

If things were working properly, I would expect to see this printed:

myarr: [[1.1, 2.1], [1.2, 2.2, -3.2]]
min (cpu): [1.1, -3.2]
min (gpu): [1.1, -3.2]
min (cpu): [0, 2]
min (gpu): [0, 2]

However, for the ak.min() with GPU I am getting [1.1, inf] (just a note in case it is useful: from some preliminary investigation it looks like this issue may be potentially specific to cases that include a value that is negative).

And, for the ak.argmin with GPU I am getting AssertionError: CuPyKernel not found: ('awkward_reduce_argmin', <class 'numpy.int64'>, <class 'numpy.float64'>, <class 'numpy.int64'>).

The full output is shown here:

myarr: [[1.1, 2.1], [1.2, 2.2, -3.2]]
min (cpu): [1.1, -3.2]
min (gpu): [1.1, inf]
min (cpu): [0, 2]
Traceback (most recent call last):
  File "/home/k.mohrman/coffea_dir/gpu_studies/columnar_gpu/mre_argmin.py", line 12, in <module>
    print("min (gpu):", ak.argmin(myarr_gpu,axis=1))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/_dispatch.py", line 41, in dispatch
    with OperationErrorContext(name, args, kwargs):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/_errors.py", line 80, in __exit__
    raise self.decorate_exception(exception_type, exception_value)
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/_dispatch.py", line 67, in dispatch
    next(gen_or_result)
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/operations/ak_argmin.py", line 79, in argmin
    return _impl(array, axis, keepdims, mask_identity, highlevel, behavior, attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/operations/ak_argmin.py", line 160, in _impl
    out = ak._do.reduce(
          ^^^^^^^^^^^^^^
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/_do.py", line 294, in reduce
    next = layout._reduce_next(
           ^^^^^^^^^^^^^^^^^^^^
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/contents/listoffsetarray.py", line 1618, in _reduce_next
    outcontent = trimmed._reduce_next(
                 ^^^^^^^^^^^^^^^^^^^^^
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/contents/numpyarray.py", line 1137, in _reduce_next
    out = reducer.apply(self, parents, starts, shifts, outlength)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/_reducers.py", line 172, in apply
    array.backend[
  File "/blue/p.chang/k.mohrman/dir_for_miniconda/miniconda3/envs/coffeagpu_env5/lib/python3.12/site-packages/awkward/_backends/cupy.py", line 39, in __getitem__
    raise AssertionError(f"CuPyKernel not found: {index!r}")
AssertionError: CuPyKernel not found: ('awkward_reduce_argmin', <class 'numpy.int64'>, <class 'numpy.float64'>, <class 'numpy.int64'>)


See if this has been reported at https://github.com/scikit-hep/awkward/issues

kmohrman avatar Jun 04 '25 17:06 kmohrman

Hi @kmohrman Indeed, argmin and argmax (and a couple more) reducers have not been implemented for the cuda backend yet. Now the wrong value you're getting for ak.min is indeed a bug. I would assume something in the kernel implementation.

ikrommyd avatar Jun 04 '25 18:06 ikrommyd

I see you're reporting a lot of cuda backend bugs lately which is a very good thing. Thanks a lot! I would like to point out that it's good to test these things, but do not trust the cuda backend for an actual analysis of course. It's still under development.

ikrommyd avatar Jun 04 '25 18:06 ikrommyd

https://github.com/scikit-hep/awkward/blob/939b46c26ef73fb5900eb9d94537e68ff666e97b/src/awkward/_connect/cuda/cuda_kernels/awkward_reduce_min.cu#L72

is this the vanilla atomicMin from CUDA? It looks like this only handles integer? https://stackoverflow.com/questions/17399119/how-do-i-use-atomicmax-on-floating-point-values-in-cuda

Edit: that's indeed the issue

@@ -69,7 +75,7 @@ awkward_reduce_min_b(

       int64_t parent = parents[thread_id];
       if (idx == blockDim.x - 1 || thread_id == lenparents - 1 || parents[thread_id] != parents[thread_id + 1]) {
-        atomicMin(&toptr[parent], temp[thread_id]);
+        toptr[parent] = temp[thread_id];
       }
     }

fixes the repro in this issue. This is okay because grid_size is already set to one anyway for your input. Ofc in general we can't do this.

Moelf avatar Jun 14 '25 00:06 Moelf

Thanks, @Moelf ! Could you please make a PR? Thanks!

ianna avatar Jun 14 '25 04:06 ianna

the problem is this code path can potentially launch more than 1 threads, we need a general solution anyway because we use this in more than one place.

But I also see: https://github.com/scikit-hep/awkward/blob/939b46c26ef73fb5900eb9d94537e68ff666e97b/src/awkward/_connect/cuda/cuda_kernels/cuda_common.cu#L173-L179

I guess there's bug here?

Moelf avatar Jun 14 '25 05:06 Moelf