Less Wright comments

Results 81 comments of


                                            Less Wright

Training advise with swin_transformer - initialization with GELU, etc.

Running madgrad with AdamW style decay and using similar decay as for AdamW so far has been the best results (slightly better accuracy and loss vs no weight decay or...

Training advise with swin_transformer - initialization with GELU, etc.

Hi @yueming-zhang, Did you use any type of lr warmup and schedule? I did use a linear one as did the swin authors per their paper and then a cosine...

FSDP example

Let me review - I was not even aware this PR existed until today, so thanks for the direct link.

FSDP example

General comment - this example does not use activation checkpointing due to the timing of this PR (it wasn't added in FSDP until after this PR). But I think it...

training tip

A generic tip is to use focal loss for training with small objects. Not sure if it will work here but it will certainly help in general.

training tip

I'm testing some other loss functions tomorrow and will let you know if any progress. Boundary loss penalty looks really good.

Fine tuning with int8 and NLP models...is stable embedding needed?

Training from scratch with a large Vision Transformer (500M) worked, so this issue seems to be specific to the NLP embeddings. I'll try to isolate the embedding layer and keep...

Fine tuning with int8 and NLP models...is stable embedding needed?

Quick update - I see very similar behaviour on T5 if you run with BF16 and stochastic rounding, so it seems the embeddings on already trained T5 are super sensitive....

Cant find libcudart.so

As a general concept, .so files (shared object) are the equivalent of .dll (dynamic linked library) on windows. But .so is for linux and android. Thus, I believe it would...

Cant find libcudart.so

note - you may want to just run using Windows subsystem for Linux (WSL) and then you should be able to run as expected. Alternatively, this is a cuda related...