opeide
opeide
I've also had this issue! And it was very hard to find with dataloader with multiple workers. I had to set workers=0 to even get told where the code was...
For anyone having similar issues, I had trouble tracing until I added a torch.jit.is_tracing() check in Anchor's forward to not use last_anchors during tracing.
I had the same issue with DDP and in my case the culprit was torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) (in my own code). I guess a BN layer was my hidden output.
For me the issue was apparently in my augmentations. In albumentations there are some augmentations that can infiniteloop, like randomfog. I was only able to see where the code froze...