darts
darts copied to clipboard
It seems the global moving average is used as opposed to what the original paper instructed
According to the original paper
- (In architecture search) we always use batch-specific statistics for batch normalization rather than the global moving average.
- Learnable affine parameters in all batch normalizations are disabled during the search process
The second statement holds true in this implementation, but I couldn't find related code for the first statement. According to my observation, all batch normalizations use global moving average.
For example, one batch norm layer has this form,
BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=False)
I think momentum (or decay in fact) should be set to 1 in the above to be consistent with the paper.