Susan Zhang issues

Results 45 issues of


                                            Susan Zhang

Restarts from restore-file does not work if restart occurs before a new checkpoint is written

Right now, if we were to launch a job with --restore-file **and** experience a job restart shortly thereafter (before a new checkpoint is written), we fail to resume again from...

bug

checkpointing

[Cleanup] checkpoint_utils: remove opena pathmanager logic

cla signed

sbatch copying to NFS todos

Assuming we still keep this sad glue: https://github.com/facebookresearch/metaseq/pull/494, tracking a few more things to address here. 1. For 1k files, single node going around ssh'ing to each of the 128...

enhancement

cleanup

Check if NFS dir and files exist before trying to copy

cla signed

Include run id in train.log

Right now, our json logging spits out something like the following: ``` 2022-10-31 09:47:44 | INFO | train_inner | {"epoch": 2, "actv_norm": "300.911", "pos_norm": "0.36", "tok_norm": "0.878", "emb_norm": "0.002", "docsperex":...

telemetry

better-eng

Split model parallel transformer layer to encoder / decoder files

Similar to how the refactor was done for non-model-parallel version, we should split https://github.com/facebookresearch/metaseq/blob/main/metaseq/model_parallel/modules/transformer_layer.py to two files (encoder vs decoder) before trying to unify the codepaths between model-parallel vs non-model-parallel...

cleanup

better-eng

Move tests into circleci or... delete?

We have a good number of tests floating around in https://github.com/facebookresearch/metaseq/tree/main/tests but these never run through circleci / on each PR. We should move as many of these back into...

cleanup

test-coverage

better-eng

Remove defaults for arch - force users to specify

There are lots of defaults set for all the `transformer_*` arches - we should remove as much of these as possible and noisily fail if users do not specify these...

cleanup

config

better-eng

Reduce the amount of kwargs floating around

As @glample notes: > kwargs appears 260 times in the codebase ! This makes code hard to read / follow if we just throw kwargs around like candy. This is...

cleanup

better-eng

Make it easy to configure non-model-parallel path & have consistent results

In other words, if https://github.com/facebookresearch/metaseq/blob/133d32c04975f94318b6281b3618c9aebf6e8100/metaseq/launcher/opt_job_constants.py#L18 is set to 1, make sure all flags are consistent. Initialization should be matching too (currently might not be the case).

enhancement

config