Fred comments

Results 64 comments of


                                            Fred

DDP training timeout

Running into the same problem. I think it is hardware-independent. The code here uses the pytorch-lightning and NeMo frameworks. It happens after 8 hours of training. ![image](https://github.com/Lightning-AI/pytorch-lightning/assets/59241275/28a8a2c2-23e2-4afa-96f5-c0d25f697c60)

DDP training timeout

Thanks! Exactly what I need for debugging! Do u have a document for how to use this tool? The link in that pr isn't valid :( 📚 Documentation preview 📚:...

DDP training timeout

Thanks!! Here is log... I can't make sense of it.. FYI, the code that I am having a problem with has been employed in two systems and both of them...

DDP training timeout

Thanks!! Since the error is in process 6, I am showing the log of nccl-rank-6 below: ``` pbg-dgx-1:1243335:1243335 [6] NCCL INFO cudaDriverVersion 12020 pbg-dgx-1:1243335:1243335 [6] NCCL INFO Bootstrap : Using...

DDP training timeout

I think the problem I have on nccl-rank-6 is just OOM based on the log? ``` pbg-dgx-1:1243335:1244302 [6] include/alloc.h:178 NCCL WARN Cuda failure 'out of memory' ```

DDP training timeout

Yeah. I think it is because some of the GPUs are already heavily utilized and triggers the OOM problem as shown in the log. I only ran your program in...

PDB structure culstering

Hi @amorehead, I try to reach you by email but got no response. It's possible that my email has ended up in their spam folder or that you have not...

PDB structure culstering

Yep! Happy to help! I think the first thing is to figure out the exact features we want. I personally have a use case. I want to cluster all pdb...

Running colab exmaple got an error ModuleNotFoundError: No module named 'torch_geometric'

Yes! I just want to raise the issue here so that other users can run your example.

Sharing answer from the question "How many iterations do I need?"

Thank you so much for the information. Have you check out the quantitative metrics during the training? I am wondering if this would be possible: the quantitative results converge while...