CSR icon indicating copy to clipboard operation
CSR copied to clipboard

Unable to train with DDP strategy

Open Bhathiya-hw opened this issue 1 year ago • 1 comments

I am trying to train edge representations on machine with 4 GPUs. However, training process hangs after validation sanity check. Training works well with single-gpu settings (and also with accelerator: 'dp' even though this is not the intended way of training).

Note: Debuging suggest that model get stuck when acessing registered buffer at: https://github.com/allenai/CSR/blob/main/src/lightning/modules/moco2_module.py#L153 . But I was unable to find a fix.

Bhathiya-hw avatar Sep 20 '22 01:09 Bhathiya-hw