StableCascade
StableCascade copied to clipboard
stage_c_3b_finetuning
how max batch size in A100 V80G ?
with a batch size of 1, it seems to be peaking at 75457MiB of VRAM according to nvidia-smi
on an A100 with 80 GB of VRAM,
Same problem here。 I train controlnet with 1 batchsize, and got cuda out of memory error with 80g vram
CUDA memory usage
config set
lr: 1.0e-4 batch_size: 1 image_size: 768
multi_aspect_ratio: [1/1, 1/2, 1/3, 2/3, 3/4, 1/5, 2/5, 3/5, 4/5, 1/6, 5/6, 9/16]
grad_accum_steps: 1 updates: 100000 backup_every: 20000 save_every: 2000 warmup_updates: 1 use_fsdp: False adaptive_loss_weight: True
insert torch.cuda.memory_allocated()
print('1-load models start:',torch.cuda.memory_allocated())
models = self.setup_models(extras)
print('2-load models end:',torch.cuda.memory_allocated())
1-load models start: 0 2-load models end: 18517808640
print('3-optimizers start:',torch.cuda.memory_allocated())
optimizers = self.setup_optimizers(extras, models)
print('4-optimizers end:',torch.cuda.memory_allocated())
3-optimizers start: 18517808640 4-optimizers end: 47230638592
conditions = self.get_conditions(batch, models, extras)
print('11-conditons:',torch.cuda.memory_allocated())
latents = self.encode_latents(batch, models, extras)
print('12-encode-latents:',torch.cuda.memory_allocated())
11-conditons: 47248081920 12-encode-latents: 47248118784
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
pred = models.generator(noised, noise_cond, **conditions)
print("13-models-generator:",torch.cuda.memory_allocated())
13-models-generator: 60122718720
loss, loss_adjusted = self.forward_pass(data, extras, models)
print("14-forward_pass:",torch.cuda.memory_allocated())
# # BACKWARD PASS
grad_norm = self.backward_pass(
i % self.config.grad_accum_steps == 0 or i == max_iters, loss, loss_adjusted,
models, optimizers, schedulers
)
print("15-backward_pass:",torch.cuda.memory_allocated())
14-forward_pass: 59979117056 15-backward_pass: 47247679488
truly wild, to get max 1 batch size on 80Gb VRAM. something is definitely wrong here. sad, too, seemed like someone made an effort to document and make the repo usable, too.
perhaps they assume you're using multiple GPUs and FSDP if finetuning the big models?
Furthermore, since distributed training is essential when training large models from scratch or doing large finetunes, we have an option to use PyTorch's Fully Shared Data Parallel (FSDP). You can use it by setting use_fsdp: True. Note, that you will need multiple GPUs for FSDP. However, this as mentioned above, this is only needed for large runs. You can still train and finetune our largest models on a powerful single machine.
Update: Just tried FSDP and 2 A100s to see if that would help. Now all that happens is my CPU works very hard and that's it. I think this repo makes a lot of assumptions about your setup: FSDP, multi-GPU, slurm, etc. https://github.com/Stability-AI/StableCascade/issues/71#issuecomment-1974039472