Susan Zhang
Susan Zhang
Right now, if we were to launch a job with --restore-file **and** experience a job restart shortly thereafter (before a new checkpoint is written), we fail to resume again from...
Assuming we still keep this sad glue: https://github.com/facebookresearch/metaseq/pull/494, tracking a few more things to address here. 1. For 1k files, single node going around ssh'ing to each of the 128...
Right now, our json logging spits out something like the following: ``` 2022-10-31 09:47:44 | INFO | train_inner | {"epoch": 2, "actv_norm": "300.911", "pos_norm": "0.36", "tok_norm": "0.878", "emb_norm": "0.002", "docsperex":...
Similar to how the refactor was done for non-model-parallel version, we should split https://github.com/facebookresearch/metaseq/blob/main/metaseq/model_parallel/modules/transformer_layer.py to two files (encoder vs decoder) before trying to unify the codepaths between model-parallel vs non-model-parallel...
We have a good number of tests floating around in https://github.com/facebookresearch/metaseq/tree/main/tests but these never run through circleci / on each PR. We should move as many of these back into...
There are lots of defaults set for all the `transformer_*` arches - we should remove as much of these as possible and noisily fail if users do not specify these...
As @glample notes: > kwargs appears 260 times in the codebase ! This makes code hard to read / follow if we just throw kwargs around like candy. This is...
In other words, if https://github.com/facebookresearch/metaseq/blob/133d32c04975f94318b6281b3618c9aebf6e8100/metaseq/launcher/opt_job_constants.py#L18 is set to 1, make sure all flags are consistent. Initialization should be matching too (currently might not be the case).