Understanding the relation between parallelism and microbatches
Hi all,
In DP-SGD, given a constant number of minibatches, let us say 256, I was wondering why increasing the number of microbatches has a bad effect on the performance. In compute_gradients function of the DPOptimizerClass, it seems, for each microbatch, we call compute_gradient function of self_super, which is the function from tf optimizer. Please correct me if I am wrong, but I assume this function makes possible the parallel computation of the gradients w.r.t. the samples inside a microbatch. However, when it comes to the gradient computation across different microbatches, to start computing the gradient of a microbatch do we wait the computation of the gradient of previous microbatch? I know that for each microbatch, we need to clip the gradient before summing it with the others, but I think is not something that harms parallelism. If so, why?
In theory, computation of the gradients of different microbatches can be parallelized since they are independent of each other. Then what is the reason behind introducing microbatches? If it is because of some restrictions due to the implementation of tensorflow, then any pointers to the related documents would be also highly appreciated.
-
To start computing the gradient of a microbatch do we wait the computation of the gradient of previous microbatch? -> Yes, the default code doesn't seem to be optimized for it, the vectorized version of it is. But the oprimizations don't show evident improvement in runtimes on a Xeon CPU (in my case), not sure why.
-
The clipping can be done in parallel if the microbatches are processed in parallel.
I think you should take a look at the vectorized implementation in this repo. That might help. The vectorized approach may not be the best way and there's surely more potential to better the parallelized computation.