Mehdi Cherti comments

Results 51 comments of


Mehdi Cherti

[WIP] Support FSDP

Update: following this thread https://github.com/huggingface/accelerate/issues/807, full / partial locking now works. Currently getting some throughput numbers with `mt5-xxl-ViT-G-14`

[WIP] Support FSDP

Update: I mentioned earlier that training was hanging with large nodes (e.g., 256 on JUWELS Booster), after checking lower nb of nodes, it seems that the starting up phase (before...

[WIP] Support FSDP

hey @nkflash, thanks I actually noticed that as well, even with smaller models, I am on it. EDIT: found a fix, will push soon

[WIP] Support FSDP

@nkflash pushed, could you please try again? I can confirm that it worked for me

[WIP] Support FSDP

Thanks, @orchidmajumder , `use_orig_params` is working as expected. So with pytorch nightly, we can already use it. If we want to also support current pytorch stable version (1.13), wrapping layer...

[WIP] Support FSDP

Yes was thinking of that as well but saw that there is already `'logit_scale' in n` in exclude

[WIP] Support FSDP

@rwightman Thanks for the suggestion, I moved the code a bit earlier now, now it is fixed.

Update:@rwightman @rom1504 @mitchellnw @gabrielilharco @JeniaJitsev just for info, regarding the starting up phase I mentioned earlier (https://github.com/mlfoundations/open_clip/pull/358#issuecomment-1423851399), I found out that it is not only proportional to nb of nodes...

[WIP] Support FSDP

Update: as the problem with large nodes is solved, following are updated scaling plots up to 1024 GPUs: G-14: ![G14](https://user-images.githubusercontent.com/509507/223187324-27444863-cf96-41fd-b9de-15fb8c4dbdf3.jpg) I also tested freezing a subset of layers, with MT5-XXL...

[WIP] Support FSDP

Update: the first fully trained model with FSDP is finished, I started with a ViT-B/32 on LAION-400M , 32 epochs (96 gpus, local bs of 896, global bs of 86016,...