ignite
ignite copied to clipboard
Using ignite with Megatron-style model-parallel PyTorch modules
❓ Questions/Help/Support
This is a somewhat general question, but I'd love a detailed response. When wanting to go beyond standard data-parallel training towards hybrid data+model-parallel training (like Megatron-LM), what are some ignite abstractions to use and avoid?
@vfdev-5
@g-karthik thanks for an interesting question! I haven't yet explored this hybrid data+model-parallel trainings and would love to test that.
@sdesrozis any thoughts ? @Nic-Ma have you tried that in MONAI ?
Hi @vfdev-5 ,
MONAI has a model-parallel tutorial: https://github.com/Project-MONAI/research-contributions/tree/master/lamp-automated-model-parallelism But I think it's not based on ignite workflow.
Thanks.
I didn't yet experienced model parallel training. I would be very pleased to explore this topic.
My first thoughts if we just consider model parallel on 2 GPUs
- engine is agnostic to device
x,yandy_predare on different devices. You can't usecreate_supervised_xxxbecause data are moved on same device...- metrics should be ok because it relies on output of update function. If you write your own function, it should work.
auto_modelfromidistcould not work because if multiple GPUs are detected,DataParallelis used...- I think that checkpoint and loggers should work but can't be 100% sure...
We first should test this before try hybrid data+model parallelism.
@g-karthik could you explain how you think distribute your model and data in that case ? Thanks in advance.
@sdesrozis take a look at https://www.deepspeed.ai/tutorials/pipeline/ and https://www.deepspeed.ai/tutorials/megatron/ and the example.
I think in addition to what @sdesrozis said, ignite.distributed module wont be aware of the "topology". It implicitly considers data parallel only axis. In the worst case, this can lead to hangs while all reducing metrics...
@vfdev-5 That's exactly what I was thinking about the collective ops in metrics.
@g-karthik @sdesrozis I'm working on how to make ignite distributed aware of particular data parallel configuration. I'll push soon a draft PR with new API and example using DeepSpeed.