h2o4gpu icon indicating copy to clipboard operation
h2o4gpu copied to clipboard

kmeans python iris test fails for multi-GPU.

Open teju85 opened this issue 7 years ago • 2 comments

Environment (for bugs)

  • OS platform, distribution and version (e.g. Linux Ubuntu 16.04): Linux Ubuntu 16.04
  • Installed from (source or binary): source
  • Version: @commitID abe45eeb717d5ab4fee2f5b1386c82c93f14ab33
  • Python version (optional): 3.5
  • CUDA/cuDNN version: cuda v9.0, cudnn v7.1, driver v384.125
  • GPU model (optional): Tesla V100 (from DGX1-Volta)
  • CPU model: Intel(R) Xeon(R) CPU E5-2698 v4
  • RAM available: 512GB

Please refer to google on how to obtain the above on your platform.

Description

make dotest fails for multi-GPU case under kmeans tests. The failing test is 'test_fit_iris', and it only fails for the multi-gpu case inside this test.

Repro instructions

$ pytest -s --verbose --durations=10 -n 1 -vv --fulltrace --full-trace --junit-xml=build/test-reports/h2o4gpu-test.xml tests_open/kmeans 2>&1 | tee run.log

Attaching the run.log below for your perusal. run.log

Interestingly, if the multi-gpu case is run with n_gpus=2, the above test passes.

teju85 avatar May 31 '18 11:05 teju85

Thanks. @mdymczyk do you have any ideas? We obviously run this test ourselves, but for 2 GPU systems on jenkins. Do you expect this test to actually pass (i.e. 1 GPU agree with 4 GPUs)?

pseudotensor avatar Jun 01 '18 02:06 pseudotensor

@pseudotensor not sure yet, it should pass on any number of GPUs but maybe there's a bug somewhere - need to look into it with a profiler. When discussing this with @teju85 he also mentioned the predictions are way off so there might be a bug somewhere we're not catching with our tests.

mdymczyk avatar Jun 01 '18 02:06 mdymczyk