center_of_mass much slower than expected?
Description
before I forget, thanks for cupy!! :-) it's been great for me so far, with some very nice speedups.
for cupyx.scipy.ndimage.center_of_mass, though, it seems like something is wrong or not yet fully optimized perhaps..? or am I doing something wrong? :)
I will try to attach a sample program that shows a speedup of only about 1.3 compared to scipy, for a 1000x1000 input matrix. it's not much faster for larger matrices.
thanks in advance for your reaction!
To Reproduce
import time
import numpy as np
import scipy
import cupy as cp
from cupyx.scipy.ndimage import center_of_mass
from cupy.cuda.runtime import deviceSynchronize
# do not count first round, as it initializes the GPU
# (as I read it, this happens only one time within a process)
for i in range(2):
size = 4000
image_data = np.ones((size,size), dtype='float64')
inputs = (image_data > 0).astype('float64')
# cpu
t0 = time.time()
b = scipy.ndimage.center_of_mass(
inputs,
labels = image_data,
index = range(1,2),
)
if i == 1:
print('calc cpu %.5f' % (time.time()-t0))
print()
ta = time.time()-t0
# gpu
t0 = t1 = time.time()
a_gpu = cp.asarray(inputs)
b_gpu = cp.asarray(image_data)
c_gpu = cp.asarray(range(1,2))
if i == 1:
print('copy to gpu %.5f' % (time.time()-t0))
t0 = time.time()
result_gpu = center_of_mass(a_gpu, b_gpu, index = c_gpu)
deviceSynchronize()
if i == 1:
print('calc on gpu %.5f' % (time.time()-t0))
t0 = time.time()
print('result', result_gpu)
if i == 1:
print('copy result to cpu %.5f' % (time.time()-t0))
tb = time.time()-t1
if i == 1:
print('total gpu %.5f' % (time.time()-t1))
# speedup
if i == 1:
print()
print('speedup %.2f' % (ta/tb))
Installation
Conda-Forge (conda install ...)
Environment
No response
Additional Information
input (1000, 1000) float64 input (1000, 1000) float64 input (1000, 1000) float64 result [array([499.5, 499.5])] calc cpu 0.09642
copy to gpu 0.00350 input (1000, 1000) float64 input (1000, 1000) float64 input (1000, 1000) float64 calc on gpu 0.06686 result [array([499.5, 499.5])] copy result to cpu 0.00020 total gpu 0.07059
speedup 1.37
How about using float32 instead of float64? Double precision is much slower than singles on GPUs.
I just tried this code in a nvidia A100
result [array([8.2595521e+304, 0.0000000e+000])]
calc cpu 2.38095
copy to gpu 0.05089
calc on gpu 0.01243
result [array([inf, nan])]
copy result to cpu 0.00049
total gpu 0.06385
speedup 37.29
I get a speedup of 37x
If I use fp32
result [array([2.70107404e+39, 0.00000000e+00])]
calc cpu 2.26985
copy to gpu 0.02580
calc on gpu 0.00718
result [array([-16.06237794, -0. ])]
copy result to cpu 0.00044
total gpu 0.03346
speedup 67.84
Then I get almost 68x Which GPU are you using?
thanks! I realized in the meantime that my conda env was a few months old. after updating it I get a good speedup of about 12 times.
float32 indeed helps, about a factor 1.5 faster. float16 is even faster.. thanks for this important tip! (relatively new to GPU programming here..)
I have a GP107GL [Quadro P620].
thanks again!
btw @emcastillo, your center of masses look weird..? (since we are passing an array filled with ones, the center of mass should be in the middle..?)
Glad to hear that! There may be some issues in how the kernel handles the atomics in devices with a high core count. We should definitely look into that