Fred
Fred
Running into the same problem. I think it is hardware-independent. The code here uses the pytorch-lightning and NeMo frameworks. It happens after 8 hours of training. 
Thanks! Exactly what I need for debugging! Do u have a document for how to use this tool? The link in that pr isn't valid :( 📚 Documentation preview 📚:...
Thanks!! Here is log... I can't make sense of it.. FYI, the code that I am having a problem with has been employed in two systems and both of them...
Thanks!! Since the error is in process 6, I am showing the log of nccl-rank-6 below: ``` pbg-dgx-1:1243335:1243335 [6] NCCL INFO cudaDriverVersion 12020 pbg-dgx-1:1243335:1243335 [6] NCCL INFO Bootstrap : Using...
I think the problem I have on nccl-rank-6 is just OOM based on the log? ``` pbg-dgx-1:1243335:1244302 [6] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory' ```
Yeah. I think it is because some of the GPUs are already heavily utilized and triggers the OOM problem as shown in the log. I only ran your program in...
Hi @amorehead, I try to reach you by email but got no response. It's possible that my email has ended up in their spam folder or that you have not...
Yep! Happy to help! I think the first thing is to figure out the exact features we want. I personally have a use case. I want to cluster all pdb...
Yes! I just want to raise the issue here so that other users can run your example.
Thank you so much for the information. Have you check out the quantitative metrics during the training? I am wondering if this would be possible: the quantitative results converge while...