cuml
cuml copied to clipboard
[QST] cuml DBSCAN is limited by cpu?
CPU bottleneck on DBSCAN? I have used a few routines from rapids cuspatial and seen up to 20-fold speed increase on the gpu compared to the cpu. However, with cuml DBSCAN I get the same execution time with the gpu, or sometimes slightly longer. I created a simple test script, and monitored the processors. The GPU is running at 100% and there is excess GPU ram on my machine, but during the GPU run, the CPU is also maxed at 100%. It makes sense that there will be some residual CPU load while executing on the GPU, but if the CPU is maxed and the overall computing time is unchanged, then it seems like the CPU is actually a bottleneck.
I have read the documentation describing CPU/GPU interoperability with rapids cuml 23.10. However, I don't see anything in there that explains this behaviour.
import os
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_circles
import time
from sklearn.cluster import DBSCAN
import pandas as pd
import cudf
from cuml import DBSCAN as cumlDBSCAN
X, y = make_circles(n_samples=int(1e5), factor=.35, noise=.05)
X[:, 0] = 3*X[:, 0]
X[:, 1] = 3*X[:, 1]
db = DBSCAN(eps=0.6, min_samples=2)
start=time.time()
db.fit_predict(X)
print('cpu {}'.format(time.time()-start))
X_df = pd.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})
X_gpu = cudf.DataFrame.from_pandas(X_df)
db_gpu = cumlDBSCAN(eps=0.6, min_samples=2)
start=time.time()
db_gpu.fit_predict(X_gpu)
print('gpu {}'.format(time.time()-start))
Thanks for opening the issue @jeb2112. I investigated it by creating a script, timing it, and averaging the results. Here are the results :
Data preprocessing alone:
real = 4.894
user = 5.316
sys = 2.514
Data preprocessing + DBSCAN work:
real = 7.614
user = 7.950
sys = 2.6
DBSCAN work (computed through difference) :
real = 2.72
user = 2.634
sys = 0.086
At this stage, it looks like the program is CPU-bound. However, DBSCAN, as many other algorithms, requires at times blocking synchronization to get a temporary result before proceeding with further work. Here is an example of a temporary result required during the processing of batches in DBSCAN.
These CUDA synchronizations may make the program appear as if it was CPU bound. In reality there's a loop that is polling the queue waiting for the result in the driver code. With a multi-threaded program, it could be interesting to save resources by relying more on the kernel to handle events and thus diminish CPU burden. This can be done by setting some CUDA flags such as cudaDeviceScheduleBlockingSync
. In the case of cuML it might not necessarily be important.
DBSCAN work with CUDA flag change (computed through difference) :
real = 2.731
user = 0.802
sys = 0.495
As you can see, the use of the flag diminishes the burden on the CPU. But, all in all, DBSCAN does not appear to be CPU-bound.
The performance differences between cuSpatial and cuML could be interesting to investigate though.