open_clip icon indicating copy to clipboard operation
open_clip copied to clipboard

OOM with batch size 1 when with ViT-bigG on 40GB GPU

Open mitchellnw opened this issue 2 years ago • 4 comments

Similarly to https://github.com/mlfoundations/open_clip/issues/261, getting OOM with batch size 1 on 40GB GPU with ViT-G.

mitchellnw avatar Dec 15 '22 04:12 mitchellnw

Weird. I once tested ViT-g-14 on RTX3090 (10G) and it could work, could refer to this, Maybe you could try multiple machines.

OrangeSodahub avatar Dec 15 '22 04:12 OrangeSodahub

sorry I mean bigG not g

mitchellnw avatar Dec 15 '22 04:12 mitchellnw

Sorry for misunderstand

OrangeSodahub avatar Dec 15 '22 04:12 OrangeSodahub

I think we've got two 'easy' options right now, DeepSpeed Zero (PR for this #264 might be worth testing) or PyTorch native FSDP. Talking w/ someone close to TPUs & PyTorch XLA recently, and they were stronly recommending giving FSDP a try for large scale runs (there's both an XLA specific varaint and normal PyTorch one).

Going full tensor parallelism is more work and I feel things are about to change w/ upcoming native PyTorch features (compilation w/ annotations for parallelism) such that needing to do it Megatron style will be a thing of the past.

rwightman avatar Dec 15 '22 07:12 rwightman

seems like progress is being made with FSDP and also we think the OOM was because of model size + activations

mitchellnw avatar Jan 09 '23 21:01 mitchellnw