jiant How to start training from a checkpoint or snapshot? [question] [documentation]

I saved a checkpoint.p at the end of my most recent training epoch. I also have the last_model.p. Now what? I'd like to throw a few more epochs at it. But so far I can only manage either to get error messages or to get a run that starts by grabbing itself a fresh new roberta-base.

If nothing else, I hoped I could pass last_model.p to hf_pretrained_model_name_or_path. I would have lost extra optimizer-related goodness lurking inside the proper checkpoint, but at least I could take advantage of the improved weights. Unfortunately, this doesn't seem to work the way I expected.

Jiant seems to expect to receive a pretrained model within expdir/models/$NAME, with an entourage of config files and tokenizer, not sitting somewhere inside expdir/runs/$WHATEVER surrounded by the detritus of another run. Extant documentation does not make it clear whether my saved model needs to be transplanted into a copy of such a setting, or what.

Feb 04 '21 02:02 eritain

Hey, me-from-the-past. I haven't done this yet, but we know some things now.

For one thing, we'll have to call run_with_continue() somehow. If the previous run went to completion, I don't know how Jiant is going to feel about us altering the config under its feet to request more epochs before we resume the run. Tendency is for it not to react to changes well.

For another thing, it's called model_path now. I know it says hf_pretrained_model_name_or_path right in the source code, but passing that argument isn't ever getting us anywhere. No, I can't tell you why hf_pretrained_model_name_or_path fails when it is in the source, and model_path succeeds when it isn't. It may be more of that creepy zconf skulduggery.

If you can get a message to us-even-further-in-the-past, tell him not to screw around with the "Simple" API. It makes some of our interests very very slightly easier, but it makes many of our interests impossible. For example, in the Main CLI API you can pass --ZZoverrides to make it accept a different model_path than what's in the model config. If there's a way to do this in the Simple API, it's hidden deeper than we have time and gumption for.

Mar 12 '21 00:03 eritain

If you can get a message to us-even-further-in-the-past, tell him not to screw around with the "Simple" API.

Again, for emphasis. Granted that the division of labor between model config, task config, task container config a.k.a. run config, and runscript arguments isn't very intelligible on first impression. It's even less intelligible when you try to use it. We will grep the source code many times while determining where to configure what, and swear to file many issues on that, and also just swear. But us-even-further-in-the-past thought learning the Simple API first would give us a leg up on that complexity; it didn't.

Mar 19 '21 20:03 eritain