lama icon indicating copy to clipboard operation
lama copied to clipboard

GPU load is severely unbalanced

Open herbiezhao opened this issue 3 years ago • 4 comments

I have 8 2080ti GPUs (11GB). When I train, I can only use 4 cards, and the batchsize can only be set to 5. GPU 0 occupies 10481 MB, and the other three cards occupies 6906 MB each. I don’t know how to solve it. I am not very familiar with pytorch lightning. If DDP is used for parallel training in pytorch, the load is balanced.

herbiezhao avatar Dec 10 '21 09:12 herbiezhao

Hi! Sorry for the leate reply!

You can try to turn off the validation - then the pipeline would use less gpu memory and load will be a bit more balanced. However, as far as I know, this is a usual issue with DDP - it always occupies more memory on gpu #0 than on others. If you want to achieve equal memory consumption, try horovod - it should be pretty easy to set up with pytorch-lightning (however, I did not try it myself)

windj007 avatar Jan 19 '22 10:01 windj007

This link might be also useful

windj007 avatar Jan 19 '22 11:01 windj007

You can try to turn off the validation how to turn off the validation? we should update .yaml?

jxbs avatar Mar 18 '22 01:03 jxbs

Just set very long period of validation, by adjusting .yaml or by adding a command line argument to train.py, e.g. trainer.kwargs.check_val_every_n_epoch=10000

However, when turning validation off you have to adjust parameters for saving the checkpoints (see configs/trainer/*.yaml)

Instead of this

checkpoint_kwargs:
  verbose: True
  save_top_k: 5
  save_last: True
  period: 1
  monitor: val_lpips_fid100_f1_total_mean
  mode: max

Write this:

checkpoint_kwargs
  verbose: True
  period: 1

As a result, you'll have a model dumped after each epoch - so you'll have to score them afterwards and choose the best one.

windj007 avatar Mar 18 '22 08:03 windj007