Vim
Vim copied to clipboard
Has anyone tried utilizing FSDP (Fully Sharded Data Parallel) for Vim?
I wonder if anyone has tried an implementation of FSDP, it would help train larger Vim models for larger datasets since FSDP will shard the models and its parameters across nodes/GPUs as well, while DDP doesn't. I am aware that FSDP is specifically optimized for Transformers, so I was wondering if anyone has an implementation or knows of one. Thanks!