Less Wright
Less Wright
Running madgrad with AdamW style decay and using similar decay as for AdamW so far has been the best results (slightly better accuracy and loss vs no weight decay or...
Hi @yueming-zhang, Did you use any type of lr warmup and schedule? I did use a linear one as did the swin authors per their paper and then a cosine...
Let me review - I was not even aware this PR existed until today, so thanks for the direct link.
General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR). But I think it...
A generic tip is to use focal loss for training with small objects. Not sure if it will work here but it will certainly help in general.
I'm testing some other loss functions tomorrow and will let you know if any progress. Boundary loss penalty looks really good.
Training from scratch with a large Vision Transformer (500M) worked, so this issue seems to be specific to the NLP embeddings. I'll try to isolate the embedding layer and keep...
Quick update - I see very similar behaviour on T5 if you run with BF16 and stochastic rounding, so it seems the embeddings on already trained T5 are super sensitive....
As a general concept, .so files (shared object) are the equivalent of .dll (dynamic linked library) on windows. But .so is for linux and android. Thus, I believe it would...
note - you may want to just run using Windows subsystem for Linux (WSL) and then you should be able to run as expected. Alternatively, this is a cuda related...