pytorch-optimizer
pytorch-optimizer copied to clipboard
Apollo optimizer eats all the GPU memory
My network is: a few dense layers (conv with padding + concatenating output to input), 2-layer LSTM and 2 Linear layers in the end. Even after I made a network laughingly small, all GPU memory (8 GB) was consumed in a few epochs.
I understand that Apollo optimizer is quasi-Newton and attempts to approximate second derivative, but still - why memory consumption grows with every epoch?
I tried putting torch.cuda.empty_cache(), torch.clear_autocast_cache() (I didn't understand this, but who knows), gc.collect() - after each call consumption dropped a bit, but not so fast as Apollo took it :)
I ran into this problem when I had set weight_decay > 0. Once I removed it memory usage was constant.
Same here