Susan Zhang issues

Results 45 issues of


Susan Zhang

Remove need_attn, need_weights, need_* flags in MHA

Noticed this comment from many moons ago: https://github.com/facebookresearch/metaseq/blob/e174f4e2b97d2d65ee8f39e80e8f60d16e4db67d/metaseq/model_parallel/modules/multihead_attention.py#L528-L532 Do some spring cleaning around this logic, along with some `need_attn`, `need_weights`, `need_head_weights` flags that don't seem to do much meaningful gating...

enhancement

better-eng

Host logbooks outside of Github

There are currently two logbooks checked into the OPT project directory: https://github.com/facebookresearch/metaseq/tree/main/projects/OPT/chronicles which were placed there for easy access on release. To avoid accumulating too many of these going forward,...

Train and release OPT-350m with layer norm

It currently seems like OPT-350m may not have been trained with layer norms, as raised by https://github.com/facebookresearch/metaseq/issues/383 and https://github.com/facebookresearch/metaseq/commit/c4b33ba6e2cd9b33539bbb5a35d831096bde3282 diff which shows that `decoder_normalize_before` was defaulted to `True` only in...

bug

Unify model-parallel vs non model-parallel codepaths

As https://github.com/facebookresearch/metaseq/issues/383 flagged, right now we have two separate codepaths for model-parallel vs non model-parallel code (aka https://github.com/facebookresearch/metaseq/blob/main/metaseq/model_parallel/models/transformer_lm.py vs https://github.com/facebookresearch/metaseq/blob/main/metaseq/models/transformer_lm.py as one example). This should all be unified.

enhancement

better-eng

Add kill-switch logic around time out errors

We are seeing cases where jobs will continue to stay alive after a series of ``` raise RuntimeError( RuntimeError: Timed out initializing process group in store based barrier on rank:...

bug

enhancement

Remove Namespace hack

See: https://github.com/facebookresearch/metaseq/blob/9afea52f5988fcbfe6133591fc27dd56044bd4ea/metaseq/checkpoint_utils.py#L424-L425 Previous attempt at removing this hack broke generation, evals, and resuming training from checkpoint (namespaces were slipping through, despite conversion to omegaconf - need to track down where...

enhancement

Enable more granular initialization strategies

One example is that positional embeddings are set to have half the standard deviation of token embeddings in GPT-2: https://github.com/openai/gpt-2/blob/a74da5d99abaaba920de8131d64da2862a8f213b/src/model.py#L152-L155 We have also seen the 175B model learn this distinction,...

enhancement

Track all of our RNG offsets to avoid collisions

We have RNG seed offsets sprinkled through the codebase: ``` (base) √ metaseq % ag seed --py | grep + cpu_tests/test_streaming_token_block_dataset.py:78: shadow_rng = np.random.default_rng(2273 + seed) cpu_tests/test_streaming_token_block_dataset.py:124: shadow_rng = np.random.default_rng(2273...

enhancement

good first issue

Automate node bisection via launching small-scale baseline

Turn this script: ``` export SUSPECTS="nodelist[1024-1035,1037-1045,1049-1068,1079-1101,1109-1121,1123-1129,1131-1151,1153-1156,1172-1179,1200-1202,1204-1211,1213-1217,1219-1226,1233-1237,1241-1252,1254-1275,1291-1298,1301-1322,1324-1326,1335-1337,1339-1341,1345-1347,1349-1355,1357-1373,1375-1388,1391-1420,1423-1432,1435-1437,1443-1450,1452-1457,1459-1462,1466-1469,1475-1483,1485-1489,1493-1515,1527-1530,1538-1544,1546-1559,1574-1579,1581-1586,1591-1596,1610-1613,1617-1626,1638-1652,1654-1660,1663-1670,1672,1675-1690,1694,1696,1698-1704,1706-1709]" export NUM_CHUNKS=4 export CHECKPOINT_DIR= scontrol show hostnames $SUSPECTS > machine_list.txt awk -v chunks="$NUM_CHUNKS" 'NR%chunks{printf "%s,",$0;next;}1' machine_list.txt | xargs -n1 scontrol show hostlist > chunked_machine_list.txt...

enhancement

Make epoch mean epoch again (in streaming language modeling)

We currently have the following unfortunate naming: https://github.com/facebookresearch/metaseq/blob/4288451502667dda2be71a0a1a9df5066b583ae8/metaseq/tasks/streaming_language_modeling.py#L271-L290 where our training corpus is chunked up into shards, but each shard gets referenced as an epoch. We should fix this confusing...

enhancement