ignite icon indicating copy to clipboard operation
ignite copied to clipboard

Using ignite with Megatron-style model-parallel PyTorch modules

Open g-karthik opened this issue 4 years ago • 7 comments

❓ Questions/Help/Support

This is a somewhat general question, but I'd love a detailed response. When wanting to go beyond standard data-parallel training towards hybrid data+model-parallel training (like Megatron-LM), what are some ignite abstractions to use and avoid?

@vfdev-5

g-karthik avatar Feb 26 '21 08:02 g-karthik

@g-karthik thanks for an interesting question! I haven't yet explored this hybrid data+model-parallel trainings and would love to test that.

@sdesrozis any thoughts ? @Nic-Ma have you tried that in MONAI ?

vfdev-5 avatar Feb 26 '21 08:02 vfdev-5

Hi @vfdev-5 ,

MONAI has a model-parallel tutorial: https://github.com/Project-MONAI/research-contributions/tree/master/lamp-automated-model-parallelism But I think it's not based on ignite workflow.

Thanks.

Nic-Ma avatar Feb 26 '21 13:02 Nic-Ma

I didn't yet experienced model parallel training. I would be very pleased to explore this topic.

sdesrozis avatar Feb 26 '21 20:02 sdesrozis

My first thoughts if we just consider model parallel on 2 GPUs

  • engine is agnostic to device
  • x, y and y_pred are on different devices. You can't use create_supervised_xxx because data are moved on same device...
  • metrics should be ok because it relies on output of update function. If you write your own function, it should work.
  • auto_model from idist could not work because if multiple GPUs are detected, DataParallel is used...
  • I think that checkpoint and loggers should work but can't be 100% sure...

We first should test this before try hybrid data+model parallelism.

@g-karthik could you explain how you think distribute your model and data in that case ? Thanks in advance.

sdesrozis avatar Feb 26 '21 21:02 sdesrozis

@sdesrozis take a look at https://www.deepspeed.ai/tutorials/pipeline/ and https://www.deepspeed.ai/tutorials/megatron/ and the example.

I think in addition to what @sdesrozis said, ignite.distributed module wont be aware of the "topology". It implicitly considers data parallel only axis. In the worst case, this can lead to hangs while all reducing metrics...

vfdev-5 avatar Feb 26 '21 22:02 vfdev-5

@vfdev-5 That's exactly what I was thinking about the collective ops in metrics.

sdesrozis avatar Feb 27 '21 07:02 sdesrozis

@g-karthik @sdesrozis I'm working on how to make ignite distributed aware of particular data parallel configuration. I'll push soon a draft PR with new API and example using DeepSpeed.

vfdev-5 avatar Mar 01 '21 08:03 vfdev-5