verl icon indicating copy to clipboard operation
verl copied to clipboard

B200 docker image support

Open dtl123456 opened this issue 1 month ago • 1 comments

When using the verlai/verl:app-verl0.5-transformers4.55.4-vllm0.10.0-mcore0.13.0-te2.2 image, an error occurs on the B200 server, but this issue does not exist on the H200 server. Is there currently a Docker image available for B200?

detailed error:
self.word_embeddings = tensor_parallel.VocabParallelEmbedding( (TaskRunner pid=39975) File "/usr/local/lib/python3.10/dist-packages/megatron/core/tensor_parallel/layers.py", line 259, in init (TaskRunner pid=39975) initialize_affine_weight_gpu(self.weight, init_method, partition_dim=0, stride=1) (TaskRunner pid=39975) File "/usr/local/lib/python3.10/dist-packages/megatron/core/tensor_parallel/layers.py", line 136, in initialize_affine_weight_gpu (TaskRunner pid=39975) init_method(weight) (TaskRunner pid=39975) File "/usr/local/lib/python3.10/dist-packages/torch/nn/init.py", line 193, in normal (TaskRunner pid=39975) return no_grad_normal(tensor, mean, std, generator) (TaskRunner pid=39975) File "/usr/local/lib/python3.10/dist-packages/torch/nn/init.py", line 22, in no_grad_normal (TaskRunner pid=39975) return tensor.normal(mean, std, generator=generator) (TaskRunner pid=39975) RuntimeError: CUDA error: no kernel image is available for execution on the device (TaskRunner pid=39975) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

dtl123456 avatar Oct 31 '25 06:10 dtl123456

同问

Ariya12138 avatar Nov 06 '25 09:11 Ariya12138