Less Wright comments

Results 79 comments of


                                            Less Wright

Add FNet as a optional implementation of ViT

I think this is also quite interesting. I'd recommend building the hybrid model implementation for more optimal speed/accuracy tradeoff: > "we found that adding self-attention sublayers to FNet models offers...

Unable to export resnest model to torch script

Thanks @zhanghang1989 - I'm going to run it in eager mode for now and may see if I can take a crack at the jit aspect next week. Congrats btw...

Unable to export resnest model to torch script

Hi @zhanghang1989 - good news in that @rwightman was able to make the fix to support JIT fix with ResNeST, and I've been able to export as JIT and run...

Recommendations for configuring heads/training on custom datasets?

I was able to do some surgery to rework mobilevit classifier to be trainable within Jupyter notebook and customize the fc to match the original ImageNet setup. It is doing...

Training advise with swin_transformer - initialization with GELU, etc.

After some fiddling, I've finally got one up and training! (lr = 3e-3, ViT style weight init, AdamW optimizer). Whether this is optimal is hard to say but compared to...

Training advise with swin_transformer - initialization with GELU, etc.

I overlooked this (that they add a global average pooling layer) so that's a key difference and that they had published their params (AdamW, 3e-1 learning rate, etc). Will add...

Training advise with swin_transformer - initialization with GELU, etc.

nvm, global avg pooling is already in the impl now with this line: ![swin_gap](https://user-images.githubusercontent.com/46302957/112916239-06a6b100-90b5-11eb-9319-f5a1ff70f2ee.JPG)

Training advise with swin_transformer - initialization with GELU, etc.

ok adding in their warmup I'm seeing pretty good results (slow but steady, but that's typical of transformers). I'm adding in gradient clipping now as a final test. Here's latest...

Training advise with swin_transformer - initialization with GELU, etc.

I tested with both gradient clipping as in the paper and with adaptive gradient clipping. Results were nearly identical in terms of validation loss (technically hard clipping at 1.0 as...

Training advise with swin_transformer - initialization with GELU, etc.

so madgrad blew away all my previous results, nearly an 18% improvement for the same limited run time (22 epochs). my friend also tested on tabular data and had similar...