Mitchell Wortsman
Mitchell Wortsman
That is a good idea, no we have not thought about this! It is difficult as supermasks which do the same thing could look very different.
sorry I mean `bigG` not `g`
seems like progress is being made with FSDP and also we think the OOM was because of model size + activations
Hey Aditya thanks for the PR with MRL -- however if you want to make MRL an option it would be good to have a flag so that this PR...
sure can you convert to draft in the meantime?
Yea totally agree, and while I'll likely keep using this for my existing run I like your implementation better for the repo going forward so I'll close this. Thanks!
closing because the hypothesis is that it relates to a filesystem issue which should not affect most
Here's one hypothesis for what's going on. Look at the graph for `logit_scale` and `samples/s` towards the end of training -- the dips in `logit_scale` occur towards the end of...
Hi Adam this looks great! I don't have access to this repo anymore because I'm not on the internship but let's keep this issue open so that other people can...
And very nice paper -- thanks for sharing!