PyPOTS
PyPOTS copied to clipboard
Run big models with DDP/FSDP instead of `torch.nn.DataParallel`
1. Feature description
Make PyPOTS run models on multi-GPU with DDP (Distributed Data Parallel, https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or FSDP (Fully Sharded Data Parallel, https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).
2. Motivation
Current multi-gpu training implemented with torch.nn.DataParallel in PyPOTS framework is not enough for training big models like Time-LLM (e.g. #675 Time-LLM easy OOM on short-len TS samples), we need more advanced feature like DDP or FSDP
3. Your contribution
Would like to lead or arrange the development task. Please leave comments below to start discussions if you're interested. More comments will help prioritize this feature.