Rene Bidart

Results 4 comments of Rene Bidart

Is anyone able to reproduce the paper's results using performer on pathfinder? Accuracy is much worse (62% vs. 77%). I was able to approximately reproduce with transformer and bigbird.

I found either lowering the learning rate or increasing the batch size was useful for this task. I think their hyperparameters are for a large effective batch size because they...

CIFAR images are only 32x32 (compared to imagenet 224x224), so you need to reduce the stride of the first few layers, or else they will perform poorly.

This is because the stride is reduced in this model, so the feature maps are smaller, so activations take up less memory.