Susan Zhang

Results 45 issues of Susan Zhang

Right now, if we were to launch a job with --restore-file **and** experience a job restart shortly thereafter (before a new checkpoint is written), we fail to resume again from...

bug
checkpointing

Assuming we still keep this sad glue: https://github.com/facebookresearch/metaseq/pull/494, tracking a few more things to address here. 1. For 1k files, single node going around ssh'ing to each of the 128...

enhancement
cleanup

Right now, our json logging spits out something like the following: ``` 2022-10-31 09:47:44 | INFO | train_inner | {"epoch": 2, "actv_norm": "300.911", "pos_norm": "0.36", "tok_norm": "0.878", "emb_norm": "0.002", "docsperex":...

telemetry
better-eng

Similar to how the refactor was done for non-model-parallel version, we should split https://github.com/facebookresearch/metaseq/blob/main/metaseq/model_parallel/modules/transformer_layer.py to two files (encoder vs decoder) before trying to unify the codepaths between model-parallel vs non-model-parallel...

cleanup
better-eng

We have a good number of tests floating around in https://github.com/facebookresearch/metaseq/tree/main/tests but these never run through circleci / on each PR. We should move as many of these back into...

cleanup
test-coverage
better-eng

There are lots of defaults set for all the `transformer_*` arches - we should remove as much of these as possible and noisily fail if users do not specify these...

cleanup
config
better-eng

As @glample notes: > kwargs appears 260 times in the codebase ! This makes code hard to read / follow if we just throw kwargs around like candy. This is...

cleanup
better-eng

In other words, if https://github.com/facebookresearch/metaseq/blob/133d32c04975f94318b6281b3618c9aebf6e8100/metaseq/launcher/opt_job_constants.py#L18 is set to 1, make sure all flags are consistent. Initialization should be matching too (currently might not be the case).

enhancement
config