lama
lama copied to clipboard
GPU load is severely unbalanced
I have 8 2080ti GPUs (11GB). When I train, I can only use 4 cards, and the batchsize can only be set to 5. GPU 0 occupies 10481 MB, and the other three cards occupies 6906 MB each. I don’t know how to solve it. I am not very familiar with pytorch lightning. If DDP is used for parallel training in pytorch, the load is balanced.
Hi! Sorry for the leate reply!
You can try to turn off the validation - then the pipeline would use less gpu memory and load will be a bit more balanced. However, as far as I know, this is a usual issue with DDP - it always occupies more memory on gpu #0 than on others. If you want to achieve equal memory consumption, try horovod - it should be pretty easy to set up with pytorch-lightning (however, I did not try it myself)
This link might be also useful
You can try to turn off the validation how to turn off the validation? we should update .yaml?
Just set very long period of validation, by adjusting .yaml
or by adding a command line argument to train.py, e.g. trainer.kwargs.check_val_every_n_epoch=10000
However, when turning validation off you have to adjust parameters for saving the checkpoints (see configs/trainer/*.yaml)
Instead of this
checkpoint_kwargs:
verbose: True
save_top_k: 5
save_last: True
period: 1
monitor: val_lpips_fid100_f1_total_mean
mode: max
Write this:
checkpoint_kwargs
verbose: True
period: 1
As a result, you'll have a model dumped after each epoch - so you'll have to score them afterwards and choose the best one.