NeuralClothSim icon indicating copy to clipboard operation
NeuralClothSim copied to clipboard

How to use ray with multi-gpus?

Open fyyakaxyy opened this issue 6 months ago • 0 comments

Hi, when I use ray with 4 gpus, it went wrong:

Traceback (most recent call last):
  File "/home/bonesimulation/ncs/train.py", line 87, in <module>
    main(config)
  File "/home/bonesimulation/ncs/train.py", line 50, in main
    model.fit(
  File "/home/miniconda3/envs/ncs3/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/miniconda3/envs/ncs3/lib/python3.10/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/bonesimulation/ncs/model/ncs.py", line 166, in test_step
    body, garment_rot6d, garment_rot6d_gt, garment, garment_gt = self(inputs, training=False)
  File "/home/bonesimulation/ncs/model/ncs.py", line 213, in call
    X, matrices, garment_rot6d_gt, rotations_back = self.call_inputs(poses, trans)
  File "/home/bonesimulation/ncs/model/ncs.py", line 250, in call_inputs
    rotations_used = broadcast_rotation(rotations_used, m_used, self.body.input_joints)
  File "/home/bonesimulation/ncs/utils/rotation.py", line 165, in broadcast_rotation
    target = tf.tensor_scatter_nd_update(target, index_list, source)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Exception encountered when calling layer "ncs" "                 f"(type NCS).

{{function_node __wrapped__TensorScatterUpdate_device_/job:localhost/replica:0/task:0/device:GPU:3}} Indices and updates specified for empty output shape [Op:TensorScatterUpdate]

Call arguments received by layer "ncs" "                 f"(type NCS):
  • inputs=('tf.Tensor(shape=(0, 18, 84, 4), dtype=float32)', 'tf.Tensor(shape=(0, 18, 3), dtype=float32)')
  • w=None
  • training=False
(<lambda> pid=515961) 2024-08-22 11:28:14.178950: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179034: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: gzailab-fuyiyu-bonesimulation-0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179061: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: gzailab-fuyiyu-bonesimulation-0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179226: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 550.67.0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179277: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.67.0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179293: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 550.67.0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179903: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA [repeated 14x across cluster]
(<lambda> pid=515961) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 14x across cluster]

However, only one gpu could work, how can I solve it?

fyyakaxyy avatar Aug 23 '24 02:08 fyyakaxyy