tf-keras icon indicating copy to clipboard operation
tf-keras copied to clipboard

Gradient accumulation support?

Open andreped opened this issue 2 years ago • 28 comments

Describe the feature and the current behavior/state:

Gradient accumulation is extremely useful when working with large images/volumetric data, using low-end hardware, or training on multiple GPUs. For me, the most important feature is to be able to use larger batch sizes without exhausting memory.

Currently, there does not seem to be a straightforward way to use gradient accumulation in Keras.

What I have tried:

In TF1, we created a wrapper that can be used on any optimizer, which changes how and when the update should happen. I have tried to implement such a method in TF2, greatly inspired by the attempt by other developers at TF-addons, such as @fsx950223 and @stefan-falk https://github.com/tensorflow/addons/pull/2525. However, I have not managed to get expected behaviour (see here to see some of the experiments I performed, and here for the optimizer wrapper implementation).

I therefore looked around for alternative solutions and found this suggestion on stack overflow. I have expanded upon this idea and it seems to be working. After some thorough debugging and benchmarking, I have made a simple solution available in this repo, such that at least one simple solution for GA exists in TF2.

Proposed solution:

The idea is extremely simple. Overload the train_step method of the tf.keras.Model and add gradient accumulation support there. In the end, I have produced a simple model wrapper, which does that for you, which you ideally should be able to apply to any tf.keras.Model to enable gradient accumulation, like so:

model = tf.keras.Model(...)
model = GAModelWrapper(n_gradients=k, inputs=model.input, outputs=model.output)

However, there is definitely some work left to be done to make it handle all scenarios, but it seems to be working fine on the use cases I have tested until now.

So what remains?

Currently, I am unsure whether this is the best approach. Perhaps there is a better way of solving this. A challenge might be to get distributed training working with multiple GPUs. I believe that was the biggest obstacle with the optimizer wrapper solution.

Are there any devs working on adding gradient accumulation support in Keras?

Are you willing to contribute it: Yes

andreped avatar Jun 02 '22 04:06 andreped