evkogs

Results 6 comments of evkogs

+1 Also same issue pytorch 2.4, cuda 12.6, p4d.24xlarge

Hi @kohya-ss, it's @GrigoryEvko here. I used this pr on 3 A100*8 nodes 2 months ago, it works fine, it can be merged. I feel that for flux models training...

Hi @kohya-ss, it's @GrigoryEvko here. I used this pr on 3 A100*8 nodes 2 months ago, it works fine, it can be merged. I feel that for flux models training...

> We will release v1.1 model for SDXL by the end of September. Additionally, we are currently training PuLID for FLUX, which will be released once it is ready. We...

I see it mainly as a complementary addition to the existing torch.distributed.elastic functionality. Also, considering numerous ways to launch a training job, the main functionality would be restoring all model...

> Thanks @wconstab ! 1) > How can we drop some members out of a communicator and add new ones when the scheduler replaces them (e.g. PyTorch ProcessGroupNCCL + nccl...