cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[QST] cuml DBSCAN is limited by cpu?

Open jeb2112 opened this issue 1 year ago • 1 comments

CPU bottleneck on DBSCAN? I have used a few routines from rapids cuspatial and seen up to 20-fold speed increase on the gpu compared to the cpu. However, with cuml DBSCAN I get the same execution time with the gpu, or sometimes slightly longer. I created a simple test script, and monitored the processors. The GPU is running at 100% and there is excess GPU ram on my machine, but during the GPU run, the CPU is also maxed at 100%. It makes sense that there will be some residual CPU load while executing on the GPU, but if the CPU is maxed and the overall computing time is unchanged, then it seems like the CPU is actually a bottleneck.

I have read the documentation describing CPU/GPU interoperability with rapids cuml 23.10. However, I don't see anything in there that explains this behaviour.

import os 
import matplotlib.pyplot as plt 
from matplotlib.colors import ListedColormap 
from sklearn.datasets import make_circles
import time
from sklearn.cluster import DBSCAN
import pandas as pd 
import cudf
from cuml import DBSCAN as cumlDBSCAN

X, y = make_circles(n_samples=int(1e5), factor=.35, noise=.05) 
X[:, 0] = 3*X[:, 0] 
X[:, 1] = 3*X[:, 1] 
db = DBSCAN(eps=0.6, min_samples=2)
start=time.time()
db.fit_predict(X)
print('cpu {}'.format(time.time()-start))

X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])}) 
X_gpu = cudf.DataFrame.from_pandas(X_df)
db_gpu = cumlDBSCAN(eps=0.6, min_samples=2)
start=time.time()
db_gpu.fit_predict(X_gpu)
print('gpu {}'.format(time.time()-start))

jeb2112 avatar Oct 29 '23 14:10 jeb2112

Thanks for opening the issue @jeb2112. I investigated it by creating a script, timing it, and averaging the results. Here are the results :

Data preprocessing alone:
 real = 4.894
 user = 5.316
 sys = 2.514

Data preprocessing + DBSCAN work:
 real = 7.614
 user = 7.950
 sys = 2.6

DBSCAN work (computed through difference) :
 real = 2.72
 user = 2.634
 sys = 0.086

At this stage, it looks like the program is CPU-bound. However, DBSCAN, as many other algorithms, requires at times blocking synchronization to get a temporary result before proceeding with further work. Here is an example of a temporary result required during the processing of batches in DBSCAN.

These CUDA synchronizations may make the program appear as if it was CPU bound. In reality there's a loop that is polling the queue waiting for the result in the driver code. With a multi-threaded program, it could be interesting to save resources by relying more on the kernel to handle events and thus diminish CPU burden. This can be done by setting some CUDA flags such as cudaDeviceScheduleBlockingSync. In the case of cuML it might not necessarily be important.

DBSCAN work with CUDA flag change (computed through difference) :
 real = 2.731
 user = 0.802
 sys = 0.495

As you can see, the use of the flag diminishes the burden on the CPU. But, all in all, DBSCAN does not appear to be CPU-bound.

The performance differences between cuSpatial and cuML could be interesting to investigate though.

viclafargue avatar Oct 30 '23 18:10 viclafargue