functional-transformer
functional-transformer copied to clipboard
Bug? Loss curves show a distinct correlation with batch id.
WandB loss curves (e.g. here) show a sawtooth form, correlated with batch ID.
Batches are randomized and this occurs even with 1-bit gradients, so it's not Adam...