Run big models with DDP/FSDP instead of `torch.nn.DataParallel`

Open WenjieDu opened this issue 1 year ago • 0 comments

1. Feature description

Make PyPOTS run models on multi-GPU with DDP (Distributed Data Parallel, https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or FSDP (Fully Sharded Data Parallel, https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).

2. Motivation

Current multi-gpu training implemented with torch.nn.DataParallel in PyPOTS framework is not enough for training big models like Time-LLM (e.g. #675 Time-LLM easy OOM on short-len TS samples), we need more advanced feature like DDP or FSDP

3. Your contribution

Would like to lead or arrange the development task. Please leave comments below to start discussions if you're interested. More comments will help prioritize this feature.

Mar 26 '25 15:03 WenjieDu