h2o4gpu icon indicating copy to clipboard operation
h2o4gpu copied to clipboard

Large Transfer time in k-means with big number of clusters

Open PlayMen opened this issue 7 years ago • 9 comments

Hi, I run benchmark from https://github.com/h2oai/h2o4gpu/blob/master/presentations/benchmarks.pdf I use 100 iterations, and different number of clusters. I see following picture

Timetransfer | Fit on GPU time, H2O, seconds

1.38588 | 9.23817 1.50101 | 12.5744 8.79409 | 20.1906 193.576 | 10.7283

For 100 clusters, I can see even less fit time (although, all 100 iterations were done for all cases), but TimeTransfer is very significant. It wasn't so big when I ran on Tesla P100 with cuda 8.0 I used nccl version, maybe I need nonccl?

Could you provide any advice, how can I reduce data transfer time for that case?

OS: Linux Installed from https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/bleeding-edge/ai/h2o/h2o4gpu/0.2-nccl-cuda9/h2o4gpu-0.2.0-cp36-cp36m-linux_x86_64.whl Python 3.6 CUDA 9 GPU: 1x Tesla V100

PlayMen avatar Jan 22 '18 12:01 PlayMen

Hi. It's possible this is a cuda9 bug. We have seen cases where simple API calls lead to blocking multi-GPU behavior. E.g. in xgboost project: https://github.com/dmlc/xgboost/commit/4d36036fe6fdc30ba4d72c84f3957a39a29ab23f

@mdymczyk Would you have time to look into this?

pseudotensor avatar Jan 22 '18 17:01 pseudotensor

Sure, I'll have a look early next week when I'm done with my current tasks.

@PlayMen thank you for the benchmarks! Will keep you updated.

mdymczyk avatar Jan 23 '18 13:01 mdymczyk

@mdymczyk Thanks for your reply! Waiting for updates

PlayMen avatar Jan 23 '18 15:01 PlayMen

I see same issue with cuda8 nccl wheel too :( Could you also clarify if "Timetransfer" from verbose output includes both device to host and host to device transfer time?

PlayMen avatar Jan 23 '18 17:01 PlayMen

@PlayMen it's only host -> GPU.

mdymczyk avatar Jan 29 '18 01:01 mdymczyk

@mdymczyk

I tried to use 0.2.0 stable version (CUDA9 nccl https://s3.amazonaws.com/h2o-release/h2o4gpu/releases/stable/ai/h2o/h2o4gpu/0.2-nccl-cuda9/h2o4gpu-0.2.0-cp36-cp36m-linux_x86_64.whl) for k-means. I see better situation with 100 clusters, 100 iterations - Timetransfer: 1.4957 Timefit: 1.2562 However, as number of cluster increases, transfer time increases dramatically:

200 clusters: Timetransfer: 10.3801 Timefit: 2.16476

1000 clusters: Timetransfer: 236.411 Timefit: 7.69439

Why is there such a large difference in transfer? Centroids table that depends on number of clusters is not so large to influence the result significantly.

OlegKremnyov avatar Apr 11 '18 12:04 OlegKremnyov

Any chance to have this fixed?

OlegKremnyov avatar Jul 30 '18 08:07 OlegKremnyov

@OlegKremnyov let me double check if that is still the case with the current bleeding edge version. Can you share the setup you ran your tests on?

@trivialfis have you maybe noticed similar issues (large time transfers)?

mdymczyk avatar Jul 31 '18 02:07 mdymczyk

Truly sorry, I missed this one on mail. I don't have a theory for this yet.

trivialfis avatar Aug 14 '18 08:08 trivialfis