keras
keras copied to clipboard
Gradient accumulation support?
Describe the feature and the current behavior/state:
Gradient accumulation is extremely useful when working with large images/volumetric data, using low-end hardware, or training on multiple GPUs. For me, the most important feature is to be able to use larger batch sizes without exhausting memory.
Currently, there does not seem to be a straightforward way to use gradient accumulation in Keras.
What I have tried:
In TF1, we created a wrapper that can be used on any optimizer, which changes how and when the update should happen. I have tried to implement such a method in TF2, greatly inspired by the attempt by other developers at TF-addons, such as @fsx950223 and @stefan-falk https://github.com/tensorflow/addons/pull/2525. However, I have not managed to get expected behaviour (see here to see some of the experiments I performed, and here for the optimizer wrapper implementation).
I therefore looked around for alternative solutions and found this suggestion on stack overflow. I have expanded upon this idea and it seems to be working. After some thorough debugging and benchmarking, I have made a simple solution available in this repo, such that at least one simple solution for GA exists in TF2.
Proposed solution:
The idea is extremely simple. Overload the train_step method of the tf.keras.Model and add gradient accumulation support there. In the end, I have produced a simple model wrapper, which does that for you, which you ideally should be able to apply to any tf.keras.Model to enable gradient accumulation, like so:
model = tf.keras.Model(...)
model = GAModelWrapper(n_gradients=k, inputs=model.input, outputs=model.output)
However, there is definitely some work left to be done to make it handle all scenarios, but it seems to be working fine on the use cases I have tested until now.
So what remains?
Currently, I am unsure whether this is the best approach. Perhaps there is a better way of solving this. A challenge might be to get distributed training working with multiple GPUs. I believe that was the biggest obstacle with the optimizer wrapper solution.
Are there any devs working on adding gradient accumulation support in Keras?
Are you willing to contribute it: Yes
@andreped Thank you for reporting this issue! Could you please specify the use cases for this feature. Thank you!
Could you please specify the use cases for this feature.
What do you mean by "use cases"? Do you mean scenarios on which having a simple way to perform gradient accumulation, would be beneficial?
Are you familiar with the concept of using gradient accumulation to "artificially" increase batch size while holding memory usage fixed? Essentially splitting a batch into smaller micro-batches, calculating the gradients for each, before averaging across, without having the entire batch in memory. It is a generic concept for "approximating" batch training - that is the use case. Or am I misunderstanding something?
/cc @georgepaw if he is interested in the design as Graphcore has an API for this in: https://github.com/graphcore/tensorflow/blob/r2.5/sdk-release-2.5/tensorflow/python/ipu/optimizers/gradient_accumulation_optimizer.py
@andreped Thanks for reporting the issue!
Currently users would need to write their own custom training loop to handle the gradient accumulation, which is not too hard, so we have not yet made this an API. I would like to understand more here - do you see gradients accumulation widely used? If it's a popular feature, we will design an API for that.
@chenmoneygithub
Currently users would need to write their own custom training loop to handle the gradient accumulation, which is not too hard, so we have not yet made this an API.
Agree. It's doable to implement it in a custom training loop. But at the same time, it would be feasible to have an API to do this with the model. fit
. Implementing a custom loop to have gradient accumulation is cumbersome (IMHO).
I would like to understand more here - do you see gradients accumulation widely used?
Yes, it's widely used when it's required. It's one of the techniques to enable larger batch training with limited computational resources. FYI, it's mentioned in pytorch-lighting as one of the effective training techniques.
Currently users would need to write their own custom training loop to handle the gradient accumulation
Actually, you don't even need to write your own custom training loops anymore in TF2. It is much easier to add support for it by overloading the train_step method. An example of how I did it can be seen here: https://github.com/andreped/GradientAccumulator/blob/main/GradientAccumulator/GAModelWrapper.py#L14
However, it is definitely a very commonly used method and it will surely be a popular feature to add. Perhaps having an API to do just what I did there, is a good idea? Not sure.
Note that it is important that this works in multi-GPU strategies, as that is one of its core usages. That is not something I have explored that much myself, but that is a popular use case for it.
Also note that if you introduce gradient accumulation naively, like I did above, then some layers will not be directly compatible. You will haven suboptimal behaviour on BatchNormalization, for instance, as it will update for every single micro-batch and not when the gradient accumulation is done for a given mini-batch. Has anyone made an attempt to fix BN for this use case? @innat @dathudeptrai
Hence, you might lose the effect of using gradient accumulation if you are use BN in your model, which is an extremely popular layer. Hence, it might be a good idea to solve that issue simulaneously.
The issue with BN in GA, has been thorough discussed for pytorch: https://forums.fast.ai/t/accumulating-gradients/33219/42
Attempts has been made, but as you can see it is not so easy to get it working properly: https://forums.fast.ai/t/accumulating-gradients/33219/62
Also note, that it appear common to just SUM the gradients in gradient accumulation instead of doing MEAN reduction. I think the latter makes more sense, but might be situations where SUM reduction is more correct. Not sure. Discussed in the same thread as mentioned.
Lastly, note this comment by @tomerk regarding how GA should be implemented in Keras (which might be a better idea than what I did, not sure): https://github.com/tensorflow/addons/issues/2260#issuecomment-747711685
Hope it helps!
Hey, I just wanted to throw in some personal experience with working on gradient accumulation in TF/Keras at Graphcore for IPUs.
- Batch Norm - for the MLPerf submission distributed batch norm is used to calculate the statistics over a bigger batch. For example if we have 64 replicas, each running with batch size 128, we could simulate batch size 256 by exchanging stats between each pair of replicas.
- Accumulation method - we've implemented three methods - we often experiment with lower precision formats, for example using fp16 for the gradient accumulation tensors. a) sum - might overflow in fp16 b) mean - feels the most natural - if the gradients are normalised before accumulation they might underflow. If they are normalised after accumulation they might overflow. c) running mean - we've found this to be more stable for lower precision formats:
accumulated_gradient = zeros()
for step in range(gradient_accumulation_factor):
micro_batch_gradients = ...
accumulated_gradient = ((step/(step +1)) * accumulated_gradient) + ((1/(step +1)) * micro_batch_gradients)
Thanks for tagging me! I'll take a look. For context: I've implemented a version that works with keras and model.fit() here: https://github.com/sokrypton/AccAdam_TF2
Thanks all for the great discussion!
@andreped Thanks for raising the BN issue, yes, it's something we should support. Actually I am curious about the performanceloss if we don't handle the accumulation for BN layer - in your experiments, was there a big performance loss caused by the suboptimal treatment to BN?
For how to handle accumulation in distributed training, I believe the current Mirrored strategy can handle both SUM and MEAN. I need to double check with our distributed training experts on 1) if GPU distributed training could support aggregating over sub-batches and also across devices. 2) if this is supported in TPUStrategy.
Actually I am curious about the performanceloss if we don't handle the accumulation for BN layer
@chenmoneygithub I have not performed a rigorous test to benchmark w/wo BN with GA, but intuition tells us that BN would update too frequently and would introduce noise to training. I have also observed this myself, where the benefit of increased batch size through GA is lost when BN was used to the model. But that is surely task and data dependent.
It has been suggested to change the default momentum hyperparameter based on the number of accumulations. Essentially, reducing momentum, the too-frequent-updates of BN would be introduce less noise, and therefore should result in more smooth behaviour. However, I have not seen an actual benchmark on this topic before, nor am I aware of best practice. Perhaps anyone else knows?
But it would surely be better to be able to accumulate the parameters in BN similarly as done for the gradients when using GA, instead of playing around with BN parameters.
EDIT: Also, this study might be a good read for anyone interested in this topic: https://arxiv.org/pdf/2110.12484.pdf
They also propose a modification to BN to work better with GA.
Can also be mentioned that this implementation for syncronization of BN exists: https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/SyncBatchNormalization
Is primarily made for multi-GPU training, but I guess it could also be used on single GPU in GA scenario? Not sure
Just mentioning that I have a stable implementation for gradient accumulation now, as a temporary solution until Keras adds a proper method for it: https://github.com/andreped/GradientAccumulator
- Simply wrap the model to add gradient accumulation support (see here for usage).
- Compatible with mixed precision training.
- As an alternative to BN, I have added support for adaptive gradient clipping (PDF).
- Compatible with both GPUs and TPUs.
What is lacking is having multi-GPU support. However, for my use case it is not that critical. My main use case is to artificially increase batch size using a single-GPU.
I came across here looking for gradient accumulation where I will train using: 1- Multiple GPUs. 2- FP16 3- Functional API.
My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs. In order to match the batch_size = 8000 mentioned in these papers, GA is a must. BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.
I came across here looking for gradient accumulation where I will train using:
1- Multiple GPUs.
2- FP16
3- Functional API.
My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs.
In order to match the batch_size = 8000 mentioned in these papers, GA is a must.
BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.
@meliksahturker Have you tested the tool I mentioned above? I have not added multi-GPU support yet, but all other stuff you mention should work. Can try to add multi-GPU support tomorrow, if you'd like.
At least using a single GPU, you can artificially increase the batch size to whichever size you want. But for such a large batch size be sure to use the right optimizer.
But yeah, about time GA was added to Keras.
I came across here looking for gradient accumulation where I will train using: 1- Multiple GPUs. 2- FP16 3- Functional API. My specific case is to train large language models (e.g., BART/PEGASUS) without TPUs or 100s of GPUs. In order to match the batch_size = 8000 mentioned in these papers, GA is a must. BART mentions that (in Section 5, sentence 1) training language models with very large batch sizes improve performance so I second that keras fit() function should support this natively.
@meliksahturker Have you tested the tool I mentioned above? I have not added multi-GPU support yet, but all other stuff you mention should work. Can try to add multi-GPU support tomorrow, if you'd like.
At least using a single GPU, you can artificially increase the batch size to whichever size you want. But for such a large batch size be sure to use the right optimizer.
But yeah, about time GA was added to Keras.
I have looked into it but seeing that it does not support multi-GPUs, I haven't tested your tool.
I thought of trying this which has an example that seems to work with multi-gpu setting but, it seems you are aware of its existence and you have developed your tool despite that. So my question is, have you experienced an issue with fsx950223's implementation?
Moreover I have seen some mention of gradient accumulation causing issues with batch normalization layer (especially in existence of FP16). Have you looked into that, too?
Thanks for developing a tool for GA, btw.
@meliksahturker If you follow the commit history of the tool you will see that I used the code you mentioned as baseline. However, I did not reach the same results as regular batch training, which is why I also went with a different approach.
In TF 2.2, it was made possible to overload the 'train_step' method of the Model class. This enabled me to trivially add GA support as well as have full control on what happened with it when mixed precision was added. This is a much simpler solution than re-doing the optimizer itself, which proved to be very challenging.
Adding multi-GPU support should be easy. I know what it takes. I just rarely use multi-GPU setups myself and thus have not had the time to add it. But I know of someone who has been successful in a project, so I will consult with him.
However I have benchmarked my tool and I achieve approximately the same results as regular batch training (as close as it gets with expected deviations due to floating point errors). So I believe it is working. I also run unit tests to check exactly this for each new update, and have run several benchmarks which I intend to make public when I get the time.
BN is not compatible with GA. It requires you to modify how and when it updates which in Keras is not trivial. However, for some of my use cases it has worked fine to use batch size 8 and accum steps 4, essentially boosting overall batch size to 32. I have also added support for adaptive gradient clipping as a suggregate for BN in GA, but I have yet to see much benefit for it compared to using BN with GA. Might need to tune some params in AGC, which you can do through the tool.
EDIT: It is a fundamental issue with using BN with GA. With or without mixed precision. They are not compatible. What I have observed are NaNs, especially using Adam. I solved this by lowering the learning rate and/or adjusting the epsilon of Adam from 1e-6 to 1e-3 (or something similar). But if you know of someone who have seen other issues with BN with GA, let me know. I'm currently making a benchmark and could add some more experiments, if of interest :)
@meliksahturker If you follow the commit history of the tool you will see that I used the code you mentioned as baseline. However, I did not reach the same results as regular batch training, which is why I also went with a different approach.
In TF 2.2, it was made possible to overload the 'train_step' method of the Model class. This enabled me to trivially add GA support as well as have full control on what happened with it when mixed precision was added. This is a much simpler solution than re-doing the optimizer itself, which proved to be very challenging.
Adding multi-GPU support should be easy. I know what it takes. I just rarely use multi-GPU setups myself and thus have not had the time to add it. But I know of someone who has been successful in a project, so I will consult with him.
However I have benchmarked my tool and I achieve approximately the same results as regular batch training (as close as it gets with expected deviations due to floating point errors). So I believe it is working. I also run unit tests to check exactly this for each new update, and have run several benchmarks which I intend to make public when I get the time.
BN is not compatible with GA. It requires you to modify how and when it updates which in Keras is not trivial. However, for some of my use cases it has worked fine to use batch size 8 and accum steps 4, essentially boosting overall batch size to 32. I have also added support for adaptive gradient clipping as a suggregate for BN in GA, but I have yet to see much benefit for it compared to using BN with GA. Might need to tune some params in AGC, which you can do through the tool.
EDIT: It is a fundamental issue with using BN with GA. With or without mixed precision. They are not compatible. What I have observed are NaNs, especially using Adam. I solved this by lowering the learning rate and/or adjusting the epsilon of Adam from 1e-6 to 1e-3 (or something similar). But if you know of someone who have seen other issues with BN with GA, let me know. I'm currently making a benchmark and could add some more experiments, if of interest :)
This is great insight, especially regarding testing fsx950223's implementation thoroughly!
Since LMs are Transformer-based, which use BN heavily, then I think I will choose skipping GA for now. The issue with BN and GA make it even more crucial for this feature to be added to fit() as a complete and noob friendly parameter, e.g. grad_accumulation_steps = 4.
Definitely. It should just be an argument to set in Model.fit() or similar. GA is already available in pytorch-lightning, and others have added support for GA in their framework. About time Keras does the same.
But note that making BN compatible with GA is definitely not easy. There was some people in the PyTorch forum that tried, but I have yet to see a working solution. Not tempted to try myself, but perhaps I will have to make a go soon. I also actively use BN, so I have the same problem as you.
A temporary fix could be to adjust the momentum term in all BN layers, which should make it more robust with GA. If I was you I would at least try that :) Let me know how it goes. Always happy to contribute!