B200 docker image support
When using the verlai/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.13.0-te2.2 image, an error occurs on the B200 server, but this issue does not exist on the H200 server. Is there currently a Docker image available for B200?
detailed error:
self.word_embeddings = tensor_parallel.VocabParallelEmbedding(
(TaskRunner pid=39975) File "/usr/local/lib/python3.10/dist-packages/megatron/core/tensor_parallel/layers.py", line 259, in init
(TaskRunner pid=39975) initialize_affine_weight_gpu(self.weight, init_method, partition_dim=0, stride=1)
(TaskRunner pid=39975) File "/usr/local/lib/python3.10/dist-packages/megatron/core/tensor_parallel/layers.py", line 136, in initialize_affine_weight_gpu
(TaskRunner pid=39975) init_method(weight)
(TaskRunner pid=39975) File "/usr/local/lib/python3.10/dist-packages/torch/nn/init.py", line 193, in normal
(TaskRunner pid=39975) return no_grad_normal(tensor, mean, std, generator)
(TaskRunner pid=39975) File "/usr/local/lib/python3.10/dist-packages/torch/nn/init.py", line 22, in no_grad_normal
(TaskRunner pid=39975) return tensor.normal(mean, std, generator=generator)
(TaskRunner pid=39975) RuntimeError: CUDA error: no kernel image is available for execution on the device
(TaskRunner pid=39975) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
同问