Less Wright
Less Wright
I think this is also quite interesting. I'd recommend building the hybrid model implementation for more optimal speed/accuracy tradeoff: > "we found that adding self-attention sublayers to FNet models offers...
Thanks @zhanghang1989 - I'm going to run it in eager mode for now and may see if I can take a crack at the jit aspect next week. Congrats btw...
Hi @zhanghang1989 - good news in that @rwightman was able to make the fix to support JIT fix with ResNeST, and I've been able to export as JIT and run...
I was able to do some surgery to rework mobilevit classifier to be trainable within Jupyter notebook and customize the fc to match the original ImageNet setup. It is doing...
After some fiddling, I've finally got one up and training! (lr = 3e-3, ViT style weight init, AdamW optimizer). Whether this is optimal is hard to say but compared to...
I overlooked this (that they add a global average pooling layer) so that's a key difference and that they had published their params (AdamW, 3e-1 learning rate, etc). Will add...
nvm, global avg pooling is already in the impl now with this line: 
ok adding in their warmup I'm seeing pretty good results (slow but steady, but that's typical of transformers). I'm adding in gradient clipping now as a final test. Here's latest...
I tested with both gradient clipping as in the paper and with adaptive gradient clipping. Results were nearly identical in terms of validation loss (technically hard clipping at 1.0 as...
so madgrad blew away all my previous results, nearly an 18% improvement for the same limited run time (22 epochs). my friend also tested on tabular data and had similar...