cupy center_of_mass much slower than expected?

Description

before I forget, thanks for cupy!! :-) it's been great for me so far, with some very nice speedups.

for cupyx.scipy.ndimage.center_of_mass, though, it seems like something is wrong or not yet fully optimized perhaps..? or am I doing something wrong? :)

I will try to attach a sample program that shows a speedup of only about 1.3 compared to scipy, for a 1000x1000 input matrix. it's not much faster for larger matrices.

thanks in advance for your reaction!

To Reproduce

import time
import numpy as np
import scipy
import cupy as cp
from cupyx.scipy.ndimage import center_of_mass
from cupy.cuda.runtime import deviceSynchronize

# do not count first round, as it initializes the GPU
# (as I read it, this happens only one time within a process)
for i in range(2):
    size = 4000

    image_data = np.ones((size,size), dtype='float64')

    inputs = (image_data > 0).astype('float64')

    # cpu
    t0 = time.time()
    b = scipy.ndimage.center_of_mass(
        inputs,
        labels = image_data,
        index = range(1,2),
    )
    if i == 1:
        print('calc cpu %.5f' % (time.time()-t0))
        print()
    ta = time.time()-t0

    # gpu
    t0 = t1 = time.time()
    a_gpu = cp.asarray(inputs)
    b_gpu = cp.asarray(image_data)
    c_gpu = cp.asarray(range(1,2))
    if i == 1:
        print('copy to gpu %.5f' % (time.time()-t0))
    t0 = time.time()
    result_gpu = center_of_mass(a_gpu, b_gpu, index = c_gpu)
    deviceSynchronize()
    if i == 1:
        print('calc on gpu %.5f' % (time.time()-t0))
    t0 = time.time()
    print('result', result_gpu)
    if i == 1:
        print('copy result to cpu %.5f' % (time.time()-t0))
    tb = time.time()-t1
    if i == 1:
        print('total gpu %.5f' % (time.time()-t1))

    # speedup
    if i == 1:
        print()
        print('speedup %.2f' % (ta/tb))

Installation

Conda-Forge (conda install ...)

Environment

No response

Additional Information

input (1000, 1000) float64 input (1000, 1000) float64 input (1000, 1000) float64 result [array([499.5, 499.5])] calc cpu 0.09642

copy to gpu 0.00350 input (1000, 1000) float64 input (1000, 1000) float64 input (1000, 1000) float64 calc on gpu 0.06686 result [array([499.5, 499.5])] copy result to cpu 0.00020 total gpu 0.07059

speedup 1.37

Sep 09 '22 08:09 srepmub

How about using float32 instead of float64? Double precision is much slower than singles on GPUs.

Sep 12 '22 05:09 kmaehashi

I just tried this code in a nvidia A100

result [array([8.2595521e+304, 0.0000000e+000])]
calc cpu 2.38095

copy to gpu 0.05089
calc on gpu 0.01243
result [array([inf, nan])]
copy result to cpu 0.00049
total gpu 0.06385

speedup 37.29

I get a speedup of 37x

If I use fp32

result [array([2.70107404e+39, 0.00000000e+00])]
calc cpu 2.26985

copy to gpu 0.02580
calc on gpu 0.00718
result [array([-16.06237794,  -0.        ])]
copy result to cpu 0.00044
total gpu 0.03346

speedup 67.84

Then I get almost 68x Which GPU are you using?

Sep 12 '22 07:09 emcastillo

thanks! I realized in the meantime that my conda env was a few months old. after updating it I get a good speedup of about 12 times.

float32 indeed helps, about a factor 1.5 faster. float16 is even faster.. thanks for this important tip! (relatively new to GPU programming here..)

I have a GP107GL [Quadro P620].

thanks again!

Sep 12 '22 09:09 srepmub

btw @emcastillo, your center of masses look weird..? (since we are passing an array filled with ones, the center of mass should be in the middle..?)

Sep 12 '22 09:09 srepmub

Glad to hear that! There may be some issues in how the kernel handles the atomics in devices with a high core count. We should definitely look into that

Sep 20 '22 16:09 emcastillo