NeuralClothSim
NeuralClothSim copied to clipboard
How to use ray with multi-gpus?
Hi, when I use ray with 4 gpus, it went wrong:
Traceback (most recent call last):
File "/home/bonesimulation/ncs/train.py", line 87, in <module>
main(config)
File "/home/bonesimulation/ncs/train.py", line 50, in main
model.fit(
File "/home/miniconda3/envs/ncs3/lib/python3.10/site-packages/keras/utils/traceback_utils.py", line 70, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/home/miniconda3/envs/ncs3/lib/python3.10/site-packages/six.py", line 719, in reraise
raise value
File "/home/bonesimulation/ncs/model/ncs.py", line 166, in test_step
body, garment_rot6d, garment_rot6d_gt, garment, garment_gt = self(inputs, training=False)
File "/home/bonesimulation/ncs/model/ncs.py", line 213, in call
X, matrices, garment_rot6d_gt, rotations_back = self.call_inputs(poses, trans)
File "/home/bonesimulation/ncs/model/ncs.py", line 250, in call_inputs
rotations_used = broadcast_rotation(rotations_used, m_used, self.body.input_joints)
File "/home/bonesimulation/ncs/utils/rotation.py", line 165, in broadcast_rotation
target = tf.tensor_scatter_nd_update(target, index_list, source)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Exception encountered when calling layer "ncs" " f"(type NCS).
{{function_node __wrapped__TensorScatterUpdate_device_/job:localhost/replica:0/task:0/device:GPU:3}} Indices and updates specified for empty output shape [Op:TensorScatterUpdate]
Call arguments received by layer "ncs" " f"(type NCS):
• inputs=('tf.Tensor(shape=(0, 18, 84, 4), dtype=float32)', 'tf.Tensor(shape=(0, 18, 3), dtype=float32)')
• w=None
• training=False
(<lambda> pid=515961) 2024-08-22 11:28:14.178950: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179034: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: gzailab-fuyiyu-bonesimulation-0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179061: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: gzailab-fuyiyu-bonesimulation-0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179226: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 550.67.0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179277: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 550.67.0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179293: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 550.67.0 [repeated 15x across cluster]
(<lambda> pid=515961) 2024-08-22 11:28:14.179903: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA [repeated 14x across cluster]
(<lambda> pid=515961) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [repeated 14x across cluster]
However, only one gpu could work, how can I solve it?