what's the difference between GBN & BN used in framework?
I've read your paper. But I don't understand the difference between GBN & BN used in framework. In my understanding, GBN does BN with local data. For distributed frameworks, they also only do BN with local data. So can you explain it please?
From what I understood in the paper, they are the same thing. In GBN, you artificially "isolate" parts of the batch when computing the values as if they were on distributed machines, even if you are training on a single system.
@Moxinilian you're right. If you're interested in more efficient implementation you could check TF BatchNorm + virtual_batch_size param. They reshape the input and then batch norm it inside the BN layer instead of making separate passes for each mini-batch.