Chevolier
Chevolier
Much thanks, I'll take a look at it! > You could explore [TorchEasyRec](https://github.com/alibaba/TorchEasyRec) and its inference service [here](https://torcheasyrec.readthedocs.io/zh-cn/latest/usage/serving.html). TorchEasyRec has further enhanced performance optimizations for inference based on TorchRec.
Still waiting for solutions ...
Met a similar problem. My model is Qwen3-Coder-30B-A3B-Instruct, and I do DPO training with 8xH100 GPUs. The training stuck in step 0 and shows NCCL Timeout.
My problem seems to be running out of memory issue when saving the checkpoint, since it needs to collects the model's weights to memory. To solve, I set "stage3_gather_16bit_weights_on_model_save": false...