chongxiaoc

Results 37 comments of chongxiaoc

Did you enable tensorboard() on all ranks? Does it hang if you only enable tensorboard on rank 0 only?

@yundai424 Can you hard-code here https://github.com/horovod/horovod/blob/master/horovod/spark/keras/remote.py#L195 for TensorBoard() parameters and see if it works for you? We have verified TF2.2 cases before but using per-epoch update frequency.

O, that is horovodrun example. Somehow I thought this ticket was reporting Spark Keras Estimator issue.

@jianyuh I installed torch 1.11 + cu113 already.

Details: ``` >>> torch.__version__ '1.11.0+cu113' >>> torch.cuda.is_available() True >>> import fbgemm_gpu Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.7/site-packages/fbgemm_gpu/__init__.py", line 12, in torch.ops.load_library(os.path.join(os.path.dirname(__file__), "fbgemm_gpu_py.so")) File "/usr/lib/python3.7/site-packages/torch/_ops.py",...

I tried with installing from src guide on homepage. The build can pass, for example: ``` [ 98%] Building CXX object bench/CMakeFiles/I8SpmdmBenchmark.dir/I8SpmdmBenchmark.cc.o [ 99%] Building CXX object bench/CMakeFiles/I8SpmdmBenchmark.dir/BenchUtils.cc.o [ 99%]...

@colin2328 I think we can close it for now. Obviously RTX5000 compute capability 7.5 is not supported from low level.

We have TF 2.7 in CI for example https://github.com/horovod/horovod/runs/8065886938?check_suite_focus=true But I don't think `experimental_run_tf_function` is explicitly set there.

Using n GPUs is equivalent to using nX batch size on a single GPU. Assuming you have 1000 samples, batch_size = 10. n_steps_per_epoch = 1000/ 10 =100 for a single...

Hi @Nafees-060 In the example above, input is a list of files, so we use `load_data(path='mnist-%d.npz' % hvd.rank())` to split different files to different GPUs (using Horovod rank). How to...