Dirk Groeneveld
Dirk Groeneveld
@epwalsh, the fused CE loss, will it work on LUMI? It seems we have to be careful, in case that same numerical problem shows up. I guess, new approach is...
OLMo is now properly integrated with Transformers!
Please file a PR! In the code, they should all be "OlmoSomething".
Ok, then we'll go with Huggingface's suggestion. That means we rename everything to `OLMo*`, right?
Can you add a note to the Changelog? Then we're good to go.
Coming late to this discussion. Are you loading optimizer state from somewhere? If you are not, you should warm up your learning rate from 0 over a number of steps.
I do this on a Mac. I think Linux has a higher default open file limit, so it takes a lot more to hit the same problem.
I realize (now) that `step_result()` is not the correct method to call. But the behavior is quite pathological.
What do you mean by "find the right parameter for init"? What's the parameter you are missing?