encodec Some details about RVQ code

❓ Questions

Hi, when I try to reproduce the training code based on your released part, I meet a question when I try to use multiple-GPU to train, that is, I find that https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L150 and https://github.com/facebookresearch/encodec/blob/main/encodec/quantization/core_vq.py#L168 will cause the DDP training stop, I find the problem is this code will cause mutilple-GPU to wait each other. Thus, I delete this line code. Now, it can be trained with torch DDP. But I donot know whether this line code will influence the performance? Can you give me some advice whether this line code can be deleted?

Nov 13 '22 08:11 yangdongchao

Good point, we actually did not use DDP for the training but custom distributed routines. We perform manual averaging of the gradients and the model buffers after the backward call using all reduce operators provided by torch.distributed. See encodec/distrib.py, in particular sync_grad and sync_buffers.

Nov 17 '22 16:11 adefossez

@yangdongchao did you succeed in training the model?

Nov 18 '22 05:11 compressor1212

@yangdongchao did you succeed in training the model?

Yes, I success training the model.

Nov 18 '22 10:11 yangdongchao

@yangdongchao can you share the code if possible?

Nov 18 '22 13:11 compressor1212

@yangdongchao can you share the code ？Thank you very much

Feb 22 '23 07:02 lizeyu519