h2o4gpu
h2o4gpu copied to clipboard
Large Transfer time in k-means with big number of clusters
Hi, I run benchmark from https://github.com/h2oai/h2o4gpu/blob/master/presentations/benchmarks.pdf I use 100 iterations, and different number of clusters. I see following picture
Timetransfer | Fit on GPU time, H2O, seconds
1.38588 | 9.23817 1.50101 | 12.5744 8.79409 | 20.1906 193.576 | 10.7283
For 100 clusters, I can see even less fit time (although, all 100 iterations were done for all cases), but TimeTransfer is very significant. It wasn't so big when I ran on Tesla P100 with cuda 8.0 I used nccl version, maybe I need nonccl?
Could you provide any advice, how can I reduce data transfer time for that case?
OS: Linux Installed from https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/bleeding-edge/ai/h2o/h2o4gpu/0.2-nccl-cuda9/h2o4gpu-0.2.0-cp36-cp36m-linux_x86_64.whl Python 3.6 CUDA 9 GPU: 1x Tesla V100
Hi. It's possible this is a cuda9 bug. We have seen cases where simple API calls lead to blocking multi-GPU behavior. E.g. in xgboost project: https://github.com/dmlc/xgboost/commit/4d36036fe6fdc30ba4d72c84f3957a39a29ab23f
@mdymczyk Would you have time to look into this?
Sure, I'll have a look early next week when I'm done with my current tasks.
@PlayMen thank you for the benchmarks! Will keep you updated.
@mdymczyk Thanks for your reply! Waiting for updates
I see same issue with cuda8 nccl wheel too :( Could you also clarify if "Timetransfer" from verbose output includes both device to host and host to device transfer time?
@PlayMen it's only host -> GPU.
@mdymczyk
I tried to use 0.2.0 stable version (CUDA9 nccl https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/stable/ai/h2o/h2o4gpu/0.2-nccl-cuda9/h2o4gpu-0.2.0-cp36-cp36m-linux_x86_64.whl) for k-means. I see better situation with 100 clusters, 100 iterations - Timetransfer: 1.4957 Timefit: 1.2562 However, as number of cluster increases, transfer time increases dramatically:
200 clusters: Timetransfer: 10.3801 Timefit: 2.16476
1000 clusters: Timetransfer: 236.411 Timefit: 7.69439
Why is there such a large difference in transfer? Centroids table that depends on number of clusters is not so large to influence the result significantly.
Any chance to have this fixed?
@OlegKremnyov let me double check if that is still the case with the current bleeding edge version. Can you share the setup you ran your tests on?
@trivialfis have you maybe noticed similar issues (large time transfers)?
Truly sorry, I missed this one on mail. I don't have a theory for this yet.