Perkz Zheng

Results 84 comments of Perkz Zheng

I have tried the original DLRM criteo mutli-gpu scrip in /examples, it is still leading to UCX errors when setting -p "ucx".

Hi, @duli2012 Can you make sure you have the same ENV setting on all nodes? you can do `NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=ENV` to generate the logs when you run the gpt_example, and...

Can you add NCCL_DEBUG=INFO, and see if we can get more detailed logs ?

> Can you tell me which version of torch, nccl, cuda, and cudnn I should use to check the operation of the main branch? it is recommended to use the...

@khj94 Can you try NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,COLL,NET for a more detailed log ? thanks.

> I am experimenting with smoothquant, and an error occurred during checkpoint conversion with the [command](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#smoothquant). > > When I download the llama2-7b model from huggingface and convert the checkpoint...

@khj94 can you try NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,COLL,NET, and save the log as a txt file ? Please make sure you set the correct ENV or we may not see detailed logs....

@khj94 can you try the fix shown here ? https://github.com/NVIDIA/TensorRT-LLM/issues/1131#issuecomment-1968641974

please set NCCL_DEBUG=INFO, run the tests again, and see if we can get more detailed logs.

did you set the ALGO explicitly ? and could you try nccl-tests with the same environment ?