It seems the global moving average is used as opposed to what the original paper instructed

Open bombs-kim opened this issue 6 years ago • 0 comments

According to the original paper

(In architecture search) we always use batch-specific statistics for batch normalization rather than the global moving average.
Learnable affine parameters in all batch normalizations are disabled during the search process

The second statement holds true in this implementation, but I couldn't find related code for the first statement. According to my observation, all batch normalizations use global moving average. For example, one batch norm layer has this form, BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=False)

I think momentum (or decay in fact) should be set to 1 in the above to be consistent with the paper.

Apr 18 '19 14:04 bombs-kim