imitation
imitation copied to clipboard
[Feature request] Gradient accumulation in CrossEntropyRewardTrainer in preference_comparisons.py
🚀 Feature
When the reward function is trained by the CrossEntropyRewardTrainer
in preference_comparisons.py
, it currently takes a batch of trajectory fragments and preferences, calculates its reward for each pair of fragments, and then at the end of the batch calculates the cross-entropy loss and computes a gradient. Instead, when some flag like accumulate_gradients
is passed to the reward trainer, the loss and gradient could be computed for each pair of fragments in the batch independently.
Motivation
This would significantly increase the batch size that is compatible with any given amount of GPU memory. When training a CNN reward net on an unwrapped atari environment with a default batch size of 30, each batch uses 30 fragment pairs * 2 fragments / pair * 100 transitions / fragment * (3 observation channels + 14 action channels + 64 channels in hidden layers) * 210 height * 160 width * 4 bytes / float32 = 65.3 GB of memory per batch. If gradients are accumulated per-fragment pair, then this gets divided by 30 for a more manageable 2.2 GB per fragment pair.
Pitch
Basically, compute losses on individual trajectories, and call loss.backward()
inside the loss calculation loop when the reward net is being trained.
Alternatives
Use smaller batches and adjust learning rates accordingly - should often work OK, but it's nice to support larger batches.
Quick suggestion here. Rather than having a binary choice either accumulate_grad=True
or accumulate_grad=False
, we could have a setting like in pytorch lightning's accumulate_grad_batches
see here. Say your GPU memory can support a batch size of 16, but you want to emulate a batch size of 64. Then you set accumulate_grad_batches = 4
. This will run loss.backward() after each step but only call optim.step() every 4 batches. This allows you to take full advantage of your hardware even when you can't fit the full batch size into memory.
Finding the correct name for accumulate_grad_batches
in our code base might be a bit tricky, but it seems more general than only allowing the pairs to be processed all in one batch or one at a time.
Yeah I agree we should either specify the "inner" batch size to execute on GPU, or the number of accumulation batches. Computing each pair of trajectory fragments on the GPU separately would come with a considerable performance overhead.
I think I favor the former solution, as you can then just the "inner" batch size to be max size allowed by your GPU for the task (maximizing performance) and forget about it.
Agree with above, and think that specifying the inner batch size makes more sense.
I suspect this issue isn't specific to preference comparisons -- I think it'd also happen in AIRL/GAIL for example. Really anywhere we might want large batch sizes. Worth verifying the code/running tests to replicate the issue in other algos and if so adding gradient accumulation support in those as well, at least if it doesn't complexify the code too much.