Yuge Zhang comments

Results 279 comments of


                                            Yuge Zhang

I think dont put compute_response_mask in trainer.py, otherwise response_mask may not contiguous, put it in deamon.py?

What's your environment, and in what step does it go wrong? (training/validation/first step/after a few steps) Do you have multiple GPUs and multiple nodes?

I think dont put compute_response_mask in trainer.py, otherwise response_mask may not contiguous, put it in deamon.py?

> torch.AcceleratorError: CUDA error: an illegal memory access was encountered This looks like a GPU OOM error to me. > In the near future I plan to try to implement...

I think dont put compute_response_mask in trainer.py, otherwise response_mask may not contiguous, put it in deamon.py?

> verl canceled chat_completion design in latest version which is very inconvenient Surprised to know. Seems that we need to figure out a plan. Either using verl in a different...

Fix training metrics before and after processing

Please merge from main as there are CI updates.

Fix training metrics before and after processing

/ci

Synthetic Monitoring Gaps

Close as no follow-ups.

Added the README and script files for training sql_agent on NPU

@xiaochulaoban please review the changes to make sure I didn't modify anything by mistake.

Added the README and script files for training sql_agent on NPU

/ci

Added the README and script files for training sql_agent on NPU

> Can it be merged into the main repository now I'll take that as a "no problem". Please open another issue/PR if you have further questions. Thanks.

qwen3 moe lora error

I think some effort is needed to support latest verl as they moved the vllm inference server to agent loop. Will look into it.