snowfall icon indicating copy to clipboard operation
snowfall copied to clipboard

Multi-GPU training

Open eeewhe opened this issue 4 years ago • 5 comments

Hi, these examples are excellent, and can we training on multi-GPUs with K2/lhotse? Just like the DDP in pytorch?

eeewhe avatar Jan 19 '21 08:01 eeewhe

I don't believe we have examples of multi-GPU training yet, but AFAIK standard PyTorch mechanisms for multi-GPU training should work. If you try, let us know, and make a PR about it!

On Tue, Jan 19, 2021 at 4:58 PM ffhh [email protected] wrote:

Hi, these examples are excellent, and can we training on multi-GPUs with K2/lhotse? Just like the DDP in pytorch?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/76, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO5BU7THFY7ZHC7MU2TS2VCSXANCNFSM4WIOG3LA .

danpovey avatar Jan 19 '21 09:01 danpovey

There is a WIP pull-request about multi-GPU training: https://github.com/k2-fsa/snowfall/pull/71

csukuangfj avatar Jan 19 '21 09:01 csukuangfj

Thanks, I'll try it. @danpovey @csukuangfj

eeewhe avatar Jan 19 '21 09:01 eeewhe

BTW I intend to finish that PR so that we can switch between single-GPU and multi-GPU training when I find some spare time. I will also need to make sure that it does the right thing when aggregating things like validation loss between the GPUs (I saw both processes return slightly different values) and storing/loading checkpoints. If you have more time and can pick it up before I do, you're welcome to do it :)

pzelasko avatar Jan 19 '21 18:01 pzelasko

Fantastic!!

On Wed, Jan 20, 2021 at 2:36 AM Piotr Żelasko [email protected] wrote:

BTW I intend to finish that PR so that we can switch between single-GPU and multi-GPU training when I find some spare time. I will also need to make sure that it does the right thing when aggregating things like validation loss between the GPUs (I saw both processes return slightly different values) and storing/loading checkpoints. If you have more time and can pick it up before I do, you're welcome to do it :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/k2-fsa/snowfall/issues/76#issuecomment-763040596, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6JUGID2MJESMMUBCLS2XGLFANCNFSM4WIOG3LA .

danpovey avatar Jan 20 '21 04:01 danpovey