simclr Is it possible to use gradient accumulation to counter small GPU memory?

Hi! I had a quick question I was wondering if I could pick your brain about:

I'm using SimCLR for very high-dimensionality data (such that I max out at batch size 4). Clearly, it really isn't feasible to run SimCLR since the batch size is so low. I was thinking about trying to use some sort of gradient accumulation technique, but my concern is that it might not quite mesh well with how the loss function works. Let's say I want to use an effective batch size of 64 (with minibatch size 4). Since we are essentially computing the dot product of the projections, instead of computing the dot product between 64 pairs like it would be in normal SimCLR, it would be like computing the averaged dot product of 8 instances of 4 pairs, and then updating the gradient. I'm not confident that this will have the same effect as a large batch size since the loss itself is reliant on comparing the single positive sample to a large number of negative samples. Do you think there is a way that I can modify this framework to simulate large batch sizes with these types of memory constraints? Or if there is a way I can get gradient accumulation to work the way I want?

Aug 11 '22 22:08 ChrisJWest

this should be possible, though i haven't tried anything like that. see https://arxiv.org/pdf/2111.10050.pdf

Aug 13 '22 00:08 chentingpc

Great paper, thanks for the link. Yes, I tried running some experiments and standard GradAcc did not do particularly well with a minibatch size this low (4). I like the ideas in the paper with rematerialization though, I might see if I can try something like that.

Aug 15 '22 19:08 ChrisJWest

feel free to share your rep here if you made it work eventually!

Aug 15 '22 23:08 chentingpc