StableCascade icon indicating copy to clipboard operation
StableCascade copied to clipboard

stage_c_3b_finetuning

Open dushwe opened this issue 1 year ago • 3 comments

how max batch size in A100 V80G ?

with a batch size of 1, it seems to be peaking at 75457MiB of VRAM according to nvidia-smi on an A100 with 80 GB of VRAM,

dushwe avatar Feb 22 '24 03:02 dushwe

Same problem here。 I train controlnet with 1 batchsize, and got cuda out of memory error with 80g vram

universewill avatar Feb 22 '24 04:02 universewill

CUDA memory usage

config set

lr: 1.0e-4 batch_size: 1 image_size: 768

multi_aspect_ratio: [1/1, 1/2, 1/3, 2/3, 3/4, 1/5, 2/5, 3/5, 4/5, 1/6, 5/6, 9/16]

grad_accum_steps: 1 updates: 100000 backup_every: 20000 save_every: 2000 warmup_updates: 1 use_fsdp: False adaptive_loss_weight: True

insert torch.cuda.memory_allocated()

print('1-load models start:',torch.cuda.memory_allocated())
models = self.setup_models(extras)
print('2-load models end:',torch.cuda.memory_allocated())

1-load models start: 0 2-load models end: 18517808640

print('3-optimizers start:',torch.cuda.memory_allocated())
optimizers = self.setup_optimizers(extras, models)
print('4-optimizers end:',torch.cuda.memory_allocated())

3-optimizers start: 18517808640 4-optimizers end: 47230638592

conditions = self.get_conditions(batch, models, extras)
print('11-conditons:',torch.cuda.memory_allocated())
latents = self.encode_latents(batch, models, extras)
print('12-encode-latents:',torch.cuda.memory_allocated())

11-conditons: 47248081920 12-encode-latents: 47248118784

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
            pred = models.generator(noised, noise_cond, **conditions)
            print("13-models-generator:",torch.cuda.memory_allocated())

13-models-generator: 60122718720


            loss, loss_adjusted = self.forward_pass(data, extras, models)
            print("14-forward_pass:",torch.cuda.memory_allocated())

            # # BACKWARD PASS
            grad_norm = self.backward_pass(
                i % self.config.grad_accum_steps == 0 or i == max_iters, loss, loss_adjusted,
                models, optimizers, schedulers
            )
            print("15-backward_pass:",torch.cuda.memory_allocated())

14-forward_pass: 59979117056 15-backward_pass: 47247679488

dushwe avatar Feb 22 '24 05:02 dushwe

truly wild, to get max 1 batch size on 80Gb VRAM. something is definitely wrong here. sad, too, seemed like someone made an effort to document and make the repo usable, too.

perhaps they assume you're using multiple GPUs and FSDP if finetuning the big models?

Furthermore, since distributed training is essential when training large models from scratch or doing large finetunes, we have an option to use PyTorch's Fully Shared Data Parallel (FSDP). You can use it by setting use_fsdp: True. Note, that you will need multiple GPUs for FSDP. However, this as mentioned above, this is only needed for large runs. You can still train and finetune our largest models on a powerful single machine.

Update: Just tried FSDP and 2 A100s to see if that would help. Now all that happens is my CPU works very hard and that's it. I think this repo makes a lot of assumptions about your setup: FSDP, multi-GPU, slurm, etc. https://github.com/Stability-AI/StableCascade/issues/71#issuecomment-1974039472

heyalexchoi avatar Mar 01 '24 21:03 heyalexchoi