open_clip
open_clip copied to clipboard
Add support for gradient accumulation.
Added a new flag --accum-freq
(accumulation frequency) which defaults to 1.
If this is greater than 1, then the optimizer is only stepped every --accum-freq
batches.
Can be combined with gradient checkpointing.
Feature was requested in case people only have a few gpus but want to train with large batch.
We don't have to merge if people think it's makes things too complicated, and can instead close but point to upon request, but at least curious to hear thoughts.
For per-gpu batch size of m
and --acum-freq k
the effective per-gpu batch size is mk
.
The basic psuedocode, when --accum-freq > 1
is:
accum_data, accum_features = [], []
for i, data in enumerate(dataloader):
opt.zero_grad()
# first, get the features for a bunch of batches without gradient tracking
with no_grad:
features = model(data)
accum_data.append(data)
accum_features.append(features)
if (i + 1) % accum_freq > 0:
continue
# now re-compute the forward pass for the previous batches, with gradient tracking
for j, data in enumerate(accum_data):
features = model(data)
all_features = cat(accum_features[:j], [features], accum_features[j+1:])
loss = get_loss(all_features)
loss.backward()
optimizer.step()
accum_data, accum_features = [], []
Interesting! Does it work with --local-loss --gather-with-grad
too?
@usuyama yep! if you check out the pseudocode above, it doesn't really depend on how loss is implemented
Very nice!
For users, it could be good to have some guidance on how much the training time overhead is.
Sounds good, using --accum-freq k
is just over k
times slower than --accum-freq 1
Hi, looks cool!
It's not obvious to me what this PR introduces in term of cpu and memory overhead. What recomputation gets done, what temporary storage is used, is there any network consequences? Could be answered either by analysing the code in details or running experiments at multiple scales
Cool! Is this an implementation of GradAccum in BASIC?
Here is a screenshot verifying that training on 8 gpus with per-gpu batch size 512 behaves the same as training on 4 gpus with per-gpu batch size 512 and accum freq 2. However, it's 2x as slow. I've also updated the readme to calrify samples/s and other information about this feature.

Cool! Is this an implementation of GradAccum in BASIC?
Not exactly but it looks like an overall similar approach.
Any thoughts on if this can be merged?
Yeah lgtm, let's go
Hello!
Thanks a lot for adding this functionality. I think there is an error in the computation of the number of samples during the logging process. It's missing the multiplication by the accum-freq
argument. I made a PR to correct it.
#327