Batch size ablation results
Hello, thanks for your great work. Can you provide additional ablations obtained using different batch size ? (e.g. smaller batch size 512/256, instead of the 1024 reported in paper) I vary the training batch size but I find that the final result vary a lot.
Hi, @Alxead . From our experience, a "sqrt" scheduling method should be used to adjust the learning rate. As our default setting, the learning rate for batch size 1024 is: 1024 / 256 * 1 = 4. With sqrt scheduling, the learning rate for batch size 512 should be: 4 * sqrt(512 / 1024) = 2.828. We can modify the train script with '--base-lr 1.414' to achieve this.
Hi, Thank you for your contribution. I was thinking that did you use learning rate decay as the learning rate is quite high and it should reduce as network converges. Thanks, Ram