metaseq `convert_to_singleton` seems to hang for OPT-66B

What is your question?

With the directory prepared

$ ls 66b/
dict.txt         reshard-model_part-0-shard0.pt  reshard-model_part-3-shard0.pt  reshard-model_part-6-shard0.pt
gpt2-merges.txt  reshard-model_part-1-shard0.pt  reshard-model_part-4-shard0.pt  reshard-model_part-7-shard0.pt
gpt2-vocab.json  reshard-model_part-2-shard0.pt  reshard-model_part-5-shard0.pt

I had to hack checkpoint_utils.py a bit, since this assumption isn't true for OPT-66B: https://github.com/facebookresearch/metaseq/blob/ac8659de23b680005a14490d72a874613ab59381/metaseq/checkpoint_utils.py#L390-L391

with the following instead

    # path to checkpoint...-shared.pt
    local_path = local_path.split('.')[0] + '-shard0.pt'
    paths_to_load = get_paths_to_load(local_path, suffix="shard")

Running the following

NCCL_SHM_DISABLE=1 NCCL_DEBUG=INFO python -m metaseq.scripts.convert_to_singleton 66b/

is taking a long time (22 hours and counting). Initially nvidia-smi looks like this: Screen Shot 2022-10-12 at 2 07 52 PM and then the process on GPU 5 terminated first, and it has been in the following state for hours:

$ nvidia-smi
Thu Oct 13 19:24:37 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:16.0 Off |                    0 |
| N/A   54C    P0    74W / 300W |  20049MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:17.0 Off |                    0 |
| N/A   53C    P0    72W / 300W |  20133MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:18.0 Off |                    0 |
| N/A   52C    P0    73W / 300W |  19845MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:19.0 Off |                    0 |
| N/A   50C    P0    70W / 300W |  19857MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   54C    P0    76W / 300W |  20073MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   47C    P0    44W / 300W |   1413MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   50C    P0    72W / 300W |  19977MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   54C    P0    69W / 300W |  19905MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1335      C   python                          19788MiB |
|    1   N/A  N/A      1419      C   ...onda/envs/user/bin/python    19872MiB |
|    2   N/A  N/A      1420      C   ...onda/envs/user/bin/python    19584MiB |
|    3   N/A  N/A      1421      C   ...onda/envs/user/bin/python    19596MiB |
|    4   N/A  N/A      1422      C   ...onda/envs/user/bin/python    19812MiB |
|    6   N/A  N/A      1424      C   ...onda/envs/user/bin/python    19716MiB |
|    7   N/A  N/A      1425      C   ...onda/envs/user/bin/python    19644MiB |
+-----------------------------------------------------------------------------+

Is there something obviously wrong here, or something I should try instead? Just in case it's really taking a long time, it's still running. The last few logging lines at INFO level look like this:

(...)
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 14 : 4[1a0] -> 2[180] via P2P/indirect/6[1c0]
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO Channel 14 : 5[1b0] -> 3[190] via P2P/indirect/1[170]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO Channel 14 : 1[170] -> 7[1d0] via P2P/indirect/3[190]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 07 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 07 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO Channel 15 : 4[1a0] -> 3[190] via P2P/indirect/0[160]
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO Channel 15 : 0[160] -> 7[1d0] via P2P/indirect/4[1a0]
i-0b2d24dbd20c27dd0:1419:3383 [1] NCCL INFO comm 0x7f5f78003090 rank 1 nranks 8 cudaDev 1 busId 170 - Init COMPLETE
i-0b2d24dbd20c27dd0:1420:3386 [2] NCCL INFO comm 0x7f7408003090 rank 2 nranks 8 cudaDev 2 busId 180 - Init COMPLETE
i-0b2d24dbd20c27dd0:1422:3388 [4] NCCL INFO comm 0x7fdfc8003090 rank 4 nranks 8 cudaDev 4 busId 1a0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:3382 [0] NCCL INFO comm 0x7f5b60003090 rank 0 nranks 8 cudaDev 0 busId 160 - Init COMPLETE
i-0b2d24dbd20c27dd0:1424:3384 [6] NCCL INFO comm 0x7fd82c003090 rank 6 nranks 8 cudaDev 6 busId 1c0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1423:3387 [5] NCCL INFO comm 0x7fd544003090 rank 5 nranks 8 cudaDev 5 busId 1b0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1421:3389 [3] NCCL INFO comm 0x7f9c64003090 rank 3 nranks 8 cudaDev 3 busId 190 - Init COMPLETE
i-0b2d24dbd20c27dd0:1425:3385 [7] NCCL INFO comm 0x7f3fe0003090 rank 7 nranks 8 cudaDev 7 busId 1d0 - Init COMPLETE
i-0b2d24dbd20c27dd0:1335:1335 [0] NCCL INFO Launch mode Parallel

What's your environment?

metaseq Version: 7828d72815a9a581ab47b95876d38cb262741883 (Oct 5 main)
PyTorch Version: 1.12.1+cu113
OS: Ubuntu 18.04.6 LTS
How you installed metaseq: pip
Build command you used (if compiling from source): N.A.
Python version: 3.10
CUDA/cuDNN version: CUDA 11.8
GPU models and configuration: 8 x V100 SXM2 32 GB

Oct 13 '22 19:10 EIFY

With the same setup on another (identical) instance, convert_to_singleton seems to be hanging in a similar state, except that now it's the process on GPU 7 that finished first:

$ nvidia-smi
Fri Oct 14 01:58:48 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:16.0 Off |                  Off |
| N/A   46C    P0    72W / 300W |  19791MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:00:17.0 Off |                  Off |
| N/A   44C    P0    69W / 300W |  19875MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  Off  | 00000000:00:18.0 Off |                  Off |
| N/A   46C    P0    70W / 300W |  19587MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  Off  | 00000000:00:19.0 Off |                  Off |
| N/A   43C    P0    67W / 300W |  19599MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  Off  | 00000000:00:1A.0 Off |                  Off |
| N/A   45C    P0    68W / 300W |  19815MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  Off  | 00000000:00:1B.0 Off |                  Off |
| N/A   45C    P0    69W / 300W |  19863MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  Off  | 00000000:00:1C.0 Off |                  Off |
| N/A   42C    P0    69W / 300W |  19719MiB / 32768MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  Off  | 00000000:00:1D.0 Off |                  Off |
| N/A   38C    P0    45W / 300W |    939MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     94277      C   python                          19788MiB |
|    1   N/A  N/A     94344      C   ...onda/envs/user/bin/python    19872MiB |
|    2   N/A  N/A     94345      C   ...onda/envs/user/bin/python    19584MiB |
|    3   N/A  N/A     94346      C   ...onda/envs/user/bin/python    19596MiB |
|    4   N/A  N/A     94347      C   ...onda/envs/user/bin/python    19812MiB |
|    5   N/A  N/A     94348      C   ...onda/envs/user/bin/python    19860MiB |
|    6   N/A  N/A     94349      C   ...onda/envs/user/bin/python    19716MiB |
+-----------------------------------------------------------------------------+

Oct 14 '22 02:10 EIFY

@punitkoura is actually working on this

Oct 14 '22 12:10 stephenroller

Isn't this similar to what Binh is working on? Creating a consolidated sharding script https://github.com/facebookresearch/metaseq/issues/376

Oct 14 '22 12:10 ruanslv

I don't know exactly what @punitkoura is working on, but the fact that

I can load OPT-2.7B within reasonable time now as long as the world size matches (i.e. on a 4-GPU instance)
OPT-66B is the only publicly-released model with files named reshard-model_part-$i-shard0.pt threfore requires the hack above

makes me think that this is a separate issue that only affects OPT-66B.

Oct 14 '22 17:10 EIFY

Punit has a patch I believe.

Oct 15 '22 13:10 stephenroller

Hi @punitkoura, could you share a bit about the root cause and the current status? I would be more than happy to help if there is something I can do!

Oct 26 '22 19:10 EIFY

@EIFY Ahh sorry for the delay! Could you have a look at this patch which saves memory when trying to consolidate different model parts into a single checkpoint? https://github.com/facebookresearch/metaseq/pull/430

I haven't merged this since the patch requires PyTorch 1.12 at least.

Oct 26 '22 19:10 punitkoura

For context, the process you mentioned is probably terminating early because of running out of memory. The patch I tagged in my previous comment does a gather only on the first process instead of consolidating everything on all processes (and thereby creating 8 copies of the model).

Oct 26 '22 19:10 punitkoura

@punitkoura I happen to be running PyTorch 1.12 so I will give it a try, but questions:

Was convert_to_singleton hanging because one of the processes died of OOM? Ideally we want it to terminate if it's no longer possible to complete.
With #430 does it still require GPU 0 to hold a complete copy of the model? I don't think 32 GB will be enough for OPT-66B...

Oct 26 '22 19:10 EIFY

And sorry about the checkpoint naming confusion... The checkpoints should ideally be named similar to the other checkpoint names. i.e.

reshard-model_part-0.pt , reshard-model_part-1.pt , etc.

instead of having that shard0 suffix. Changing this would enable you to load the checkpoint without having to patch checkpoint_utils.py .

I'll work on getting the names fixed in the meantime.

Oct 26 '22 19:10 punitkoura

@punitkoura I happen to be running PyTorch 1.12 so I will give it a try, but questions:

Was convert_to_singleton hanging because one of the processes died of OOM? Ideally we want it to terminate if it's no longer possible to complete.

Yes, that is my hypothesis. I observed this when trying to consolidate other larger models as well. I agree, we should detect this condition and terminate instead of hanging.

With Convert to singleton.py - Gathering parameters only on rank 0 to save memory for large models #430 does it still require GPU 0 to hold a complete copy of the model? I don't think 32 GB will be enough for OPT-66B...

We won't be using GPU 0 to store the whole model. We stitch all parameters on CPU, so we won't need extra GPU memory from what I've observed. (See the .cpu() call in convert_to_singleton). As long as you have enough CPU memory you should be fine. But let me know if you still face issues.

Oct 26 '22 19:10 punitkoura

Tl;Dr - The current state of convert_to_singleton seems to waste CPU memory by creating multiple copies of the model. Which we try to fix using the path in https://github.com/facebookresearch/metaseq/pull/430

Oct 26 '22 19:10 punitkoura

Using #430 convert_to_singleton completed successfully after writing restored.pt 🎉 However, metaseq-api-local failed to load from it as it tries to put the whole model on GPU 0:

$ metaseq-api-local
2022-10-26 21:15:46 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/restored.pt
2022-10-26 21:30:11 | INFO | metaseq.checkpoint_utils | Done reading from disk
Traceback (most recent call last):
  File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
    sys.exit(cli_main())
  File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/default_user/metaseq/metaseq/distributed/utils.py", line 279, in call_main
    return main(cfg, **kwargs)
  File "/home/default_user/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 579, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/default_user/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task
    model = build_model_hook(cfg, task)
  File "/home/default_user/metaseq/metaseq/hub_utils.py", line 553, in _build_model
    model = task.build_model(cfg.model).cuda()
  File "/home/default_user/metaseq/metaseq/tasks/base_task.py", line 531, in build_model
    model = models.build_model(args, self)
  File "/home/default_user/metaseq/metaseq/models/__init__.py", line 87, in build_model
    return model.build_model(cfg, task)
  File "/home/default_user/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model
    decoder = TransformerDecoder(
  File "/home/default_user/metaseq/metaseq/models/transformer_decoder.py", line 127, in __init__
    layers.append(self.build_decoder_layer(args))
  File "/home/default_user/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer
    layer = self.build_base_decoder_layer(args)
  File "/home/default_user/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer
    return TransformerDecoderLayer(args)
  File "/home/default_user/metaseq/metaseq/modules/transformer_decoder_layer.py", line 94, in __init__
    self.fc2 = self.build_fc2(
  File "/home/default_user/metaseq/metaseq/modules/transformer_decoder_layer.py", line 134, in build_fc2
    return Linear(
  File "/home/default_user/metaseq/metaseq/modules/linear.py", line 41, in __init__
    torch.empty(out_features, in_features, device=device, dtype=dtype)
RuntimeError: CUDA out of memory. Tried to allocate 648.00 MiB (GPU 0; 31.75 GiB total capacity; 30.64 GiB already allocated; 270.94 MiB free; 30.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I thought metaseq would shard the model automatically here. Are there configs / env variables that can make it? I have tried the latest main just in case and it didn't help. @punitkoura

Oct 26 '22 21:10 EIFY

@EIFY Metaseq won't shard the model automatically here... Could you let me know the end result you're trying to achieve? You could have just loaded the model parallel model to be used with metaseq-api-local (without consolidating)

Oct 26 '22 23:10 punitkoura

@punitkoura I just want to load the model and run inference (i.e. sentence completion). What should I put as MODEL_FILE in constants.py in order to load the model parallel model?

try:
    # internal logic denoting where checkpoints are in meta infrastructure
    from metaseq_internal.constants import CHECKPOINT_FOLDER
except ImportError:
    # CHECKPOINT_FOLDER should point to a shared drive (e.g. NFS) where the
    # checkpoints from S3 are stored. As an example:
    # CHECKPOINT_FOLDER = "/example/175B/reshard_no_os"
    # $ ls /example/175B/reshard_no_os
    # reshard-model_part-0.pt
    # reshard-model_part-1.pt
    # reshard-model_part-2.pt
    # reshard-model_part-3.pt
    # reshard-model_part-4.pt
    # reshard-model_part-5.pt
    # reshard-model_part-6.pt
    # reshard-model_part-7.pt
    CHECKPOINT_FOLDER = "/home/jason_chou/redspot_home/66b/"

# tokenizer files
BPE_MERGES = os.path.join(CHECKPOINT_FOLDER, "gpt2-merges.txt")
BPE_VOCAB = os.path.join(CHECKPOINT_FOLDER, "gpt2-vocab.json")
MODEL_FILE = os.path.join(CHECKPOINT_FOLDER, "restored.pt")

Oct 26 '22 23:10 EIFY

@EIFY You can see this README on how to override constants.py https://github.com/facebookresearch/metaseq/blob/main/metaseq/cli/README.md

You can actually just follow my_first_override.py

Change the checkpoint location. If the checkpoint is at

/path/to/checkpoint/reshard-model_part-{mp}.pt

where mp goes from [0-7]

You would specify the checkpoint folder as

CHECKPOINT_FOLDER = "/path/to/checkpoint/reshard.pt"

MODEL_PARALLEL will be 8.

"--ddp-backend fully_sharded", - since the checkpoint is FSDP wrapped.

Let me know if you face issues here

Oct 26 '22 23:10 punitkoura

@punitkoura I think this is what you meant but it didn't work:

Firstly, I manually renamed the files

for i in {0..7}
do
  mv reshard-model_part-$i-shard0.pt reshard-model_part-$i.pt
done

such that

$ ls /home/jason_chou/redspot_home/66b
dict.txt         gpt2-vocab.json          reshard-model_part-1.pt  reshard-model_part-3.pt  reshard-model_part-5.pt  reshard-model_part-7.pt
gpt2-merges.txt  reshard-model_part-0.pt  reshard-model_part-2.pt  reshard-model_part-4.pt  reshard-model_part-6.pt  restored.pt

Secondly I edited constants.py accordingly:

$ cat metaseq/metaseq/service/constants.py
# Copyright (c) Meta Platforms, Inc. and affiliates. All Rights Reserved.
#
# This source code is licensed under the MIT license found in the
# LICENSE file in the root directory of this source tree.

import os

MAX_SEQ_LEN = 2048
BATCH_SIZE = 2048  # silly high bc we dynamically batch by MAX_BATCH_TOKENS
MAX_BATCH_TOKENS = 3072
DEFAULT_PORT = 6010
MODEL_PARALLEL = 8
TOTAL_WORLD_SIZE = 8
MAX_BEAM = 16

try:
    # internal logic denoting where checkpoints are in meta infrastructure
    from metaseq_internal.constants import CHECKPOINT_FOLDER
except ImportError:
    # CHECKPOINT_FOLDER should point to a shared drive (e.g. NFS) where the
    # checkpoints from S3 are stored. As an example:
    # CHECKPOINT_FOLDER = "/example/175B/reshard_no_os"
    # $ ls /example/175B/reshard_no_os
    # reshard-model_part-0.pt
    # reshard-model_part-1.pt
    # reshard-model_part-2.pt
    # reshard-model_part-3.pt
    # reshard-model_part-4.pt
    # reshard-model_part-5.pt
    # reshard-model_part-6.pt
    # reshard-model_part-7.pt
    CHECKPOINT_FOLDER = "/home/jason_chou/redspot_home/66b/"

# tokenizer files
BPE_MERGES = os.path.join(CHECKPOINT_FOLDER, "gpt2-merges.txt")
BPE_VOCAB = os.path.join(CHECKPOINT_FOLDER, "gpt2-vocab.json")
MODEL_FILE = os.path.join(CHECKPOINT_FOLDER, "reshard.pt")


LAUNCH_ARGS = [
    f"--model-parallel-size {MODEL_PARALLEL}",
    f"--distributed-world-size {TOTAL_WORLD_SIZE}",
    "--ddp-backend fully_sharded",
    "--task language_modeling",
    f"--bpe-merges {BPE_MERGES}",
    f"--bpe-vocab {BPE_VOCAB}",
    "--bpe hf_byte_bpe",
    f"--merges-filename {BPE_MERGES}",  # TODO(susanz): hack for getting interactive_hosted working on public repo
    f"--vocab-filename {BPE_VOCAB}",  # TODO(susanz): hack for getting interactive_hosted working on public repo
    f"--path {MODEL_FILE}",
    "--beam 1 --nbest 1",
    "--distributed-port 13000",
    "--checkpoint-shard-count 1",
    f"--batch-size {BATCH_SIZE}",
    f"--buffer-size {BATCH_SIZE * MAX_SEQ_LEN}",
    f"--max-tokens {BATCH_SIZE * MAX_SEQ_LEN}",
    "/tmp",  # required "data" argument.
]

However

$ metaseq-api-local
2022-10-26 23:45:43 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
Traceback (most recent call last):
  File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
    sys.exit(cli_main())
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 380, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 279, in call_main
    return main(cfg, **kwargs)
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 186, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 147, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 132, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 457, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 392, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 332, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 319, in _is_checkpoint_sharded
    raise FileNotFoundError(
FileNotFoundError: We weren't able to find any checkpoints corresponding to the parameters you set. This could mean you have a typo, or it could mean you have a mismatch in distributed training parameters, especially --fsdp or--model-parallel. If you are working on a new script, it may also mean you failed to fsdp_wrap or you have an unnecessary fsdp_wrap.

I might have tried this or something similar before with other sizes.

Oct 26 '22 23:10 EIFY

@EIFY sorry about that. Let me replicate your steps and add some print statements in a separate branch to figure out the root cause. I'll update this issue in a bit.

Oct 26 '22 23:10 punitkoura

@EIFY I made a branch here with a couple of print statements and the config I used https://github.com/facebookresearch/metaseq/tree/punitkoura/debug-407

One missing flag is --use-sharded-state which I added, but your error comes before that.

This is the output I get

$ python interactive_hosted.py 
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-10-27 01:10:05 | INFO | metaseq.hub_utils | loading model(s) from /data/66B/reshard_no_os/reshard.pt
In load_model_ensemble_and_task filenames = ['/data/66B/reshard_no_os/reshard.pt'] arg_overrides = {} suffix = -model_part-0
Inside load_checkpoint_to_cpu path = /data/66B/reshard_no_os/reshard-model_part-0.pt arg_overrides = {}
Inside get_paths_to_load local_path = /data/66B/reshard_no_os/reshard-model_part-0.pt suffix = shard checkpoint_files = ['/data/66B/reshard_no_os/reshard-model_part-0.pt']
2022-10-27 01:12:12 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-10-27 01:12:34 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-10-27 01:12:41 | INFO | metaseq.cli.interactive | loaded model 0
2022-10-27 01:13:10 | INFO | metaseq.cli.interactive | Worker engaged! 10.37.65.78:6010
 * Serving Flask app 'interactive_hosted' (lazy loading)

I ran the interactive_hosted command, but interactive_cli hits the same model loading logic.

Oct 27 '22 01:10 punitkoura

@EIFY Could you use this branch and paste the output you get?

Oct 27 '22 01:10 punitkoura

It seems to me that distributed process groups weren't initialized properly. In addition to Punit's suggestion, can you also quickly check if Slurm environment variables have been inherited correctly (e.g. simply run echo $SLURM_STEP_NODELIST)? This might be important because with the distributed port set, we often initialize distributed process groups by inferring Slurm, and the checkpoint suffix depends on whether the initialization finishes successfully.

$ metaseq-api-local
2022-10-26 23:45:43 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
Traceback (most recent call last):
  File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
    sys.exit(cli_main())
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 380, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 279, in call_main
    return main(cfg, **kwargs)
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 186, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 147, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 132, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 457, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 392, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 332, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 319, in _is_checkpoint_sharded
    raise FileNotFoundError(
FileNotFoundError: We weren't able to find any checkpoints corresponding to the parameters you set. This could mean you have a typo, or it could mean you have a mismatch in distributed training parameters, especially --fsdp or--model-parallel. If you are working on a new script, it may also mean you failed to fsdp_wrap or you have an unnecessary fsdp_wrap.

Oct 27 '22 02:10 tangbinh

@punitkoura running off origin/punitkoura/debug-407 (fbcf3e35b552126f0bfa8ef40f93b11614aaa2f8) with no change other than CHECKPOINT_FOLDER = "/home/jason_chou/redspot_home/66b/":

$ metaseq-api-local
2022-10-27 02:18:18 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
In load_model_ensemble_and_task filenames = ['/home/jason_chou/redspot_home/66b/reshard.pt'] arg_overrides = {} suffix = 
Inside load_checkpoint_to_cpu path = /home/jason_chou/redspot_home/66b/reshard.pt arg_overrides = {}
Inside get_paths_to_load local_path = /home/jason_chou/redspot_home/66b/reshard.pt suffix = shard checkpoint_files = []
Traceback (most recent call last):
  File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
    sys.exit(cli_main())
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 380, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 279, in call_main
    return main(cfg, **kwargs)
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 186, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 147, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 132, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 464, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 397, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 335, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 321, in _is_checkpoint_sharded
    raise FileNotFoundError(
FileNotFoundError: We weren't able to find any checkpoints corresponding to the parameters you set. This could mean you have a typo, or it could mean you have a mismatch in distributed training parameters, especially --fsdp or--model-parallel. If you are working on a new script, it may also mean you failed to fsdp_wrap or you have an unnecessary fsdp_wrap.

As for interactive_hosted.py:

~/metaseq/metaseq/cli$ python interactive_hosted.py
2022-10-27 02:24:33 | WARNING | metaseq.cli.interactive | Missing slurm configuration, defaulting to 'use entire node' for API
2022-10-27 02:24:34 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
In load_model_ensemble_and_task filenames = ['/home/jason_chou/redspot_home/66b/reshard.pt'] arg_overrides = {} suffix = 
Inside load_checkpoint_to_cpu path = /home/jason_chou/redspot_home/66b/reshard.pt arg_overrides = {}
Inside get_paths_to_load local_path = /home/jason_chou/redspot_home/66b/reshard.pt suffix = shard checkpoint_files = []
Traceback (most recent call last):
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 394, in <module>
    cli_main()
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 380, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 279, in call_main
    return main(cfg, **kwargs)
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 186, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 147, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 132, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 464, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 397, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 335, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 321, in _is_checkpoint_sharded
    raise FileNotFoundError(
FileNotFoundError: We weren't able to find any checkpoints corresponding to the parameters you set. This could mean you have a typo, or it could mean you have a mismatch in distributed training parameters, especially --fsdp or--model-parallel. If you are working on a new script, it may also mean you failed to fsdp_wrap or you have an unnecessary fsdp_wrap.

@tangbinh I don't think I have Slurm installed but I don't think that's the issue. It seems that the suffix -model_part-0 is missing.

Oct 27 '22 02:10 EIFY

I don't think I have Slurm installed but I don't think that's the issue. It seems that the suffix -model_part-0 is missing.

You're right, Slurm isn't required and might not relevant here. But indeed, I think the distributed process groups weren't initialized properly for some reason, which resulted in the missing checkpoint suffix (the suffix is set in distributed_init). You can see that Punit's log has lines such as initializing tensor model parallel with size 8 and yours didn't.

Perhaps you can print out cfg.distributed_training.distributed_init_method here and try to figure out if distributed_main was invoked.

Oct 27 '22 02:10 tangbinh

@tangbinh Yes, you're right, distributed init is not happening for @EIFY since slurm cannot be found. I think we can hack together the distributed config to fix this.

@EIFY could you pull the changes in the same branch (origin/punitkoura/debug-407) and run again to print your cfg.distributed_training config? For me it looks like

Oct 27 '22 04:10 punitkoura

@EIFY I'm guessing distributed_init_method would be None for you. In that case, we can replace it with "tcp://localhost:13000". But let's get to that once we see your output.

Oct 27 '22 04:10 punitkoura

@punitkoura At 8500e88 I got:

$ metaseq-api-localcfg.distributed_training = {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': 13000, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'broadcast_buffers': False, 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'bf16': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None, 'distributed_num_procs': 8}
2022-10-27 04:53:30 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
In load_model_ensemble_and_task filenames = ['/home/jason_chou/redspot_home/66b/reshard.pt'] arg_overrides = {} suffix = 
Inside load_checkpoint_to_cpu path = /home/jason_chou/redspot_home/66b/reshard.pt arg_overrides = {}
Inside get_paths_to_load local_path = /home/jason_chou/redspot_home/66b/reshard.pt suffix = shard checkpoint_files = []
Traceback (most recent call last):
  File "/home/jason_chou/.conda/envs/user/bin/metaseq-api-local", line 8, in <module>
    sys.exit(cli_main())
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 380, in cli_main
    distributed_utils.call_main(cfg, worker_main, namespace_args=args)
  File "/home/jason_chou/metaseq/metaseq/distributed/utils.py", line 282, in call_main
    return main(cfg, **kwargs)
  File "/home/jason_chou/metaseq/metaseq/cli/interactive_hosted.py", line 186, in worker_main
    models = generator.load_model()  # noqa: F841
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 147, in load_model
    models, _model_args, _task = _load_checkpoint()
  File "/home/jason_chou/metaseq/metaseq/hub_utils.py", line 132, in _load_checkpoint
    return checkpoint_utils.load_model_ensemble_and_task(
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 464, in load_model_ensemble_and_task
    state = load_checkpoint_to_cpu(filename, arg_overrides)
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 397, in load_checkpoint_to_cpu
    paths_to_load = get_paths_to_load(local_path, suffix="shard")
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 335, in get_paths_to_load
    if not _is_checkpoint_sharded(checkpoint_files):
  File "/home/jason_chou/metaseq/metaseq/checkpoint_utils.py", line 321, in _is_checkpoint_sharded
    raise FileNotFoundError(
FileNotFoundError: We weren't able to find any checkpoints corresponding to the parameters you set. This could mean you have a typo, or it could mean you have a mismatch in distributed training parameters, especially --fsdp or--model-parallel. If you are working on a new script, it may also mean you failed to fsdp_wrap or you have an unnecessary fsdp_wrap.

Sounds like 'distributed_init_method': None is the issue: It should fall back to something instead of nothing...?

Oct 27 '22 04:10 EIFY

@EIFY One last thing, could you try with 517d7addd514546371de1c5ebe6a0af4ce2fe120 ? I think this should work.

And if it works, I think I might know what's going wrong...

Oct 27 '22 05:10 punitkoura

@punitkoura 517d7ad indeed works 🎉:

$ git checkout remotes/origin/punitkoura/debug-407
M       metaseq/service/constants.py
Previous HEAD position was 8500e88 Add logging
HEAD is now at 517d7ad Add localhost
$ 
$ metaseq-api-local
cfg.distributed_training = {'_name': None, 'distributed_world_size': 8, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': 'tcp://localhost:13000', 'distributed_port': 13000, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'fully_sharded', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'broadcast_buffers': False, 'zero_sharding': 'none', 'fp16': False, 'memory_efficient_fp16': False, 'bf16': False, 'no_reshard_after_forward': False, 'fp32_reduce_scatter': False, 'cpu_offload': False, 'use_sharded_state': True, 'gradient_predivide_factor': None, 'distributed_num_procs': 8}
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 0
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 1
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 6
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 5
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 3
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 2
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 4
2022-10-27 05:06:21 | INFO | metaseq.distributed.utils | initialized host i-050656823f6e88c4b as rank 7
In distributed utils - cfg.common.model_parallel_size = 8
> initializing tensor model parallel with size 8
> initializing pipeline model parallel with size 1
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1
2022-10-27 05:06:25 | INFO | metaseq.hub_utils | loading model(s) from /home/jason_chou/redspot_home/66b/reshard.pt
In load_model_ensemble_and_task filenames = ['/home/jason_chou/redspot_home/66b/reshard.pt'] arg_overrides = {} suffix = -model_part-0
Inside load_checkpoint_to_cpu path = /home/jason_chou/redspot_home/66b/reshard-model_part-0.pt arg_overrides = {}
Inside get_paths_to_load local_path = /home/jason_chou/redspot_home/66b/reshard-model_part-0.pt suffix = shard checkpoint_files = ['/home/jason_chou/redspot_home/66b/reshard-model_part-0.pt']
2022-10-27 05:10:48 | INFO | metaseq.checkpoint_utils | Done reading from disk
2022-10-27 05:10:52 | INFO | metaseq.checkpoint_utils | Done loading state dict
2022-10-27 05:10:52 | INFO | metaseq.cli.interactive | loaded model 0
2022-10-27 05:10:55 | INFO | metaseq.cli.interactive | Worker engaged! 172.21.41.241:6010
 * Serving Flask app 'metaseq.cli.interactive_hosted' (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
2022-10-27 05:10:55 | INFO | werkzeug | WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:6010
 * Running on http://172.21.41.241:6010
2022-10-27 05:10:55 | INFO | werkzeug | Press CTRL+C to quit
2022-10-27 05:11:16 | INFO | metaseq.hub_utils | Preparing generator with settings {'_name': None, 'beam': 1, 'nbest': 1, 'max_len_a': 0, 'max_len_b': 70, 'min_len': 42, 'sampling': True, 'sampling_topp': 0.9, 'temperature': 1.0, 'no_seed_provided': False, 'buffer_size': 4194304, 'input': '-'}
2022-10-27 05:11:16 | INFO | metaseq.hub_utils | Executing generation on input tensor size torch.Size([1, 38])
2022-10-27 05:11:18 | INFO | metaseq.hub_utils | Total time: 1.235 seconds; generation time: 1.228
2022-10-27 05:11:18 | INFO | werkzeug | 127.0.0.1 - - [27/Oct/2022 05:11:18] "POST /completions HTTP/1.1" 200 -

I have checked the generated tokens and they look reasonable.

Oct 27 '22 05:10 EIFY

So, the problem is that we are providing the

"--distributed-port 13000",

flag in constants.py

Because of which the code branches into slurm.

If we remove that param, we would go into this code https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L112 which does set the init-method appropriately (same as what we did here).

This is where the branching happens https://github.com/facebookresearch/metaseq/blob/main/metaseq/distributed/utils.py#L42

Oct 27 '22 05:10 punitkoura

Could you try this once? Remove the hardcoded "tcp://localhost:13000" and just remove "--distributed-port 13000" from constants.py

Oct 27 '22 05:10 punitkoura

metaseq metaseq copied to clipboard

`convert_to_singleton` seems to hang for OPT-66B

What is your question?

What's your environment?

metaseq
metaseq copied to clipboard