Flish Wang
Flish Wang
I also met this bug, and created a issue on the pytorch side.
Some workaround methods that may work: - Decorate at least one **forward** function with torch.compile of the model **before** the triton kernel called. The more compiled functions there are, the...
Paramters generally should not be directly changed in the forward pass. For best practise, you may use self.register_buffer instead. Anyway, if you really want to change the data in a...
> Is what you're looking for `a=A().to(memory_format=torch.channels_last_3d)` Nope. The model is for 2d pictures and the 5d param is something like tokens/masks/attn biases/position embeds.
> Root cause: torch.compile maybe incompatible torch.cuda.is_current_stream_capturing() Ref: https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/deepseek_v2.py#L715 > > Success: `python3 -m sglang.launch_server --model /DeepSeek-V3 --tp 8 --trust-remote-code --mem-fraction-static 0.7 --cuda-graph-max-bs 16` > > Failed: `python3 -m sglang.launch_server...
> 我还有一个问题,计算reward的时候好像是串行的,这时候显卡利用率为0,而等genRM打分要很久,有没有办法或者示例是异步进行的,例如rollout一条genRM同时打分一条,不需要等到rollout完再打分 The async rollout server in the latest verl seems to be able to compute_reward just after each rollout process finished. Maybe you can call genRM server during compute_reward...