chongxiaoc comments

Results 37 comments of


                                            chongxiaoc

Stall ranks with tf.keras.callbacks.TensorBoard

Did you enable tensorboard() on all ranks? Does it hang if you only enable tensorboard on rank 0 only?

Stall ranks with tf.keras.callbacks.TensorBoard

@yundai424 Can you hard-code here https://github.com/horovod/horovod/blob/master/horovod/spark/keras/remote.py#L195 for TensorBoard() parameters and see if it works for you? We have verified TF2.2 cases before but using per-epoch update frequency.

Stall ranks with tf.keras.callbacks.TensorBoard

O, that is horovodrun example. Somehow I thought this ticket was reporting Spark Keras Estimator issue.

Failure in import: undefined symbol error from Python 3.7 + CUDA113

@jianyuh I installed torch 1.11 + cu113 already.

Failure in import: undefined symbol error from Python 3.7 + CUDA113

Details: ``` >>> torch.__version__ '1.11.0+cu113' >>> torch.cuda.is_available() True >>> import fbgemm_gpu Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.7/site-packages/fbgemm_gpu/__init__.py", line 12, in torch.ops.load_library(os.path.join(os.path.dirname(__file__), "fbgemm_gpu_py.so")) File "/usr/lib/python3.7/site-packages/torch/_ops.py",...

Failure in import: undefined symbol error from Python 3.7 + CUDA113

I tried with installing from src guide on homepage. The build can pass, for example: ``` [ 98%] Building CXX object bench/CMakeFiles/I8SpmdmBenchmark.dir/I8SpmdmBenchmark.cc.o [ 99%] Building CXX object bench/CMakeFiles/I8SpmdmBenchmark.dir/BenchUtils.cc.o [ 99%]...

chongxiaoc

Stall ranks with tf.keras.callbacks.TensorBoard

Stall ranks with tf.keras.callbacks.TensorBoard

Stall ranks with tf.keras.callbacks.TensorBoard

Failure in import: undefined symbol error from Python 3.7 + CUDA113

Failure in import: undefined symbol error from Python 3.7 + CUDA113

Failure in import: undefined symbol error from Python 3.7 + CUDA113

Failure in import: undefined symbol error from Python 3.7 + CUDA113

Do we have to pass experimental_run_tf_function = False for model.compile()

Number of epochs division on number of GPU

Number of epochs division on number of GPU