Chevolier comments

Results 16 comments of


                                            Chevolier

[QST] NVTabular function is not supported for this dtype: size

same issue

torchrec Build inference library and example server failure

Much thanks, I'll take a look at it! > You could explore [TorchEasyRec](https://github.com/alibaba/TorchEasyRec) and its inference service [here](https://torcheasyrec.readthedocs.io/zh-cn/latest/usage/serving.html). TorchEasyRec has further enhanced performance optimizations for inference based on TorchRec.

torchrec Build inference library and example server failure

Still waiting for solutions ...

Qwen3-Next-80B-A3B-Thinking hangs during multi-node SFT training with Ray (NCCL timeout on InfiniBand)

Met a similar problem. My model is Qwen3-Coder-30B-A3B-Instruct, and I do DPO training with 8xH100 GPUs. The training stuck in step 0 and shows NCCL Timeout.

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1580, OpType=ALLREDUCE, NumelIn=466119168, NumelOut=466119168, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.

same problem

Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1580, OpType=ALLREDUCE, NumelIn=466119168, NumelOut=466119168, Timeout(ms)=600000) ran for 600004 milliseconds before timing out.

My problem seems to be running out of memory issue when saving the checkpoint, since it needs to collects the model's weights to memory. To solve, I set "stage3_gather_16bit_weights_on_model_save": false...