edgeai-mmdetection
                                
                                 edgeai-mmdetection copied to clipboard
                                
                                    edgeai-mmdetection copied to clipboard
                            
                            
                            
                        RuntimeError: NCCL communicator was aborted on rank 1
Thanks for your error report and we appreciate it a lot.
Checklist
- I have searched related issues but cannot get the expected help.
- I have read the FAQ documentation but cannot get the expected help.
- The bug has not been fixed in the latest version.
Describe the bug A clear and concise description of what the bug is.
Reproduction
- What command or script did you run?
./run_detection_train.sh
- 
Did you make any modifications on the code or config? Did you understand what you have modified? NO . 
- 
What dataset did you use? 
My own dataset (like bdd100k), about 11.2W pics in training dataset
Thanks for your nice work,Now we have some problems and need your help. I start training with my own data set. When the training ends at one epoch, the following error will be reported:(see the attachment for the specific log)

We look forward to your reply !!! Thanks a lot!
I am not an expert in CUDA / NCCL. But please search a bit and see if you get a solution. For example, I think these threads may be useful:
https://stackoverflow.com/questions/69693950/error-some-nccl-operations-have-failed-or-timed-out https://discuss.pytorch.org/t/runtimeerror-nccl-communicator-was-aborted/136630/2
@lilyswang hello,I have the same error,have you solve the problem?
@mathmanu I have try the way in your link , but it do not work ,so sad!
Facing exact the same problem...