metaseq DGX 2 with 16 V100

What is your question?

We have a DGX 2 with 16 V100 32GB and would like to run the 175B model from OPT. How can is reshard a consolidated .pt to 16 shard to overcome the memory limit?

What have you tried?

Loading the model on 8 graphic cards but this results in OOM errors

Jun 01 '22 15:06 Skyy93

Thanks, I have two people trying this on smaller hardware now to see if it works. I'll update with their findings (if you don't hear back further in a few days, @ me in this issue)

Internally, we have (roughly) this code for launching the API on slurm:

def _spawn_workers(num_nodes=1):
    model_parallel = 8

    job_name = "api"

    cmd_template = f"""
    sbatch --ntasks-per-node {model_parallel} --gpus-per-node 8 --nodes {num_nodes} \\
        -n {model_parallel} -c 10 \\
        --mem 400gb --job-name {job_name} --open-mode append \\
        --wrap "srun --quit-on-interrupt -K1 python3 -m metaseq_cli.interactive_hosted"
    """.strip()
    subprocess.Popen([cmd], shell=True)

I think (but haven't explicitly tested in some time) that just changing num_nodes=2 will work.

Jun 02 '22 03:06 stephenroller

Oh, -n {model_parallel} may need to be come -n {num_nodes * model_parallel}

Jun 02 '22 03:06 stephenroller

I tried with world size 16 but it just stucks with 8 processes currently :( I will wait a bit till you have some findings :) Thank you!

Jun 02 '22 09:06 Skyy93

Did you try the change to the -n args? Can you paste a log for us?

Jun 02 '22 16:06 stephenroller

I am sorry we experience currently problems with our slurmd, I will provide logs as soon as we fixed the problem with it. Yes i changed the -n to 16.

Jun 03 '22 12:06 Skyy93

Just tried:

        size mismatch for _fpw_module.decoder.layers.95._fsdp_wrapped_module.flat_param_0: copying a param with shape torch.Size([226576896]) from checkpoint, the shape in current model is torch.Size([113288448]).

Jun 06 '22 16:06 stephenroller

YES, this is my issue!

Jun 06 '22 16:06 Skyy93

Watching this issue with interest, for hardware planning.

With 16x 32GB GPU RAM = 512GB, should that be enough to load the model in inference? In principle it seems like you would need 175x4= 700GB.

A natural configuration is 8 80GB A100's, but it also seems slightly short - will it be possible to do inference with 175B on 8 x 80GB = 640GB GPU RAM?

Jun 11 '22 08:06 davidbau

@klshuster has made some progress on this, is fighting one last demon.

Jun 14 '22 14:06 stephenroller

Update: @klshuster managed to get things to run with FSDP+MP, but it's very slow.

@ngoyal2707 is working on resharding to >8 MP workers, which should significantly improve throughput. Not quite as close as 1 giant machine but should be closer.

Jun 21 '22 22:06 stephenroller

Update: we got MP16 working on AWS earlier today. Latency was 120ms/token compared to MP8's 78ms/token.

All credit to @klshuster and @ngoyal2707 for their hard work in delivering this. We'll post instructions soon.

Jun 24 '22 03:06 stephenroller

Any update on instructions for this?

Jul 01 '22 19:07 samuelstevens

+1 for any update on instructions for this?

Sep 01 '22 01:09 aaronsnoswell

Assuming we have the following:

CHECKPOINT=/path/to/fsdp_sharded_checkpoint/checkpoint_last
CONSOLIDATED=/path/to/new_consolidated_checkpoint/
RESHARDED=/path/to/new_resharded_checkpoint/
MP=16

Step 0

(Optional, if necessary) Consolidate the model from the FSDP shards into one checkpoint:

python consolidate_fsdp_shards.py $CHECKPOINT $CONSOLIDATED/consolidated

Step 1

Use the model parallel reshard script to reshard the model into 16 parts.

python reshard_model_parallel.py $CONSOLIDATED/consolidated $MP --save-prefix $RESHARDED/reshard

Step 2

Update the constants file to point to the right paths

MODEL_PARALLEL = 16
TOTAL_WORLD_SIZE = 16
.
.
.
CHECKPOINT_FOLDER=$RESHARDED  # note, make sure you leave out the `reshard.pt`; that is added automatically

Step 3

SLURM command (adapted from api docs)

MODEL_PARALLEL=8   # while we resharded to 16, it's still technically 8 per node
NODES=2
srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \
    --quit-on-interrupt --job-name genwork \
    python3 -m metaseq_cli.interactive_hosted

Sep 01 '22 14:09 klshuster

@klshuster One clarification question on Step 0. With consolidate_fsdp_shards.py, the default will produce model with mp=1, do I need to pass argument of '--new-arch-name transformer_lm_gpt' ? Or it is fine to still initialize model as megatron model?

Sep 07 '22 20:09 lilisierrayu

still fine to init as megatron model

Sep 12 '22 21:09 klshuster

Assuming we have the following:

CHECKPOINT=/path/to/fsdp_sharded_checkpoint/checkpoint_last
CONSOLIDATED=/path/to/new_consolidated_checkpoint/
RESHARDED=/path/to/new_resharded_checkpoint/
MP=16

Step 0

(Optional, if necessary) Consolidate the model from the FSDP shards into one checkpoint:

python consolidate_fsdp_shards.py $CHECKPOINT $CONSOLIDATED/consolidated

Step 1

Use the model parallel reshard script to reshard the model into 16 parts.

python reshard_model_parallel.py $CONSOLIDATED/consolidated $MP --save-prefix $RESHARDED/reshard

Step 2

Update the constants file to point to the right paths

MODEL_PARALLEL = 16
TOTAL_WORLD_SIZE = 16
.
.
.
CHECKPOINT_FOLDER=$RESHARDED  # note, make sure you leave out the `reshard.pt`; that is added automatically

Step 3

SLURM command (adapted from api docs)

MODEL_PARALLEL=8   # while we resharded to 16, it's still technically 8 per node
NODES=2
srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \
    --quit-on-interrupt --job-name genwork \
    python3 -m metaseq_cli.interactive_hosted

Hi, Thank you for sharing this amazing repo and solution! I have been following the steps to inference opt-175b on two nodes of V100*8. However, it seems that I have an out-of-memory problem as shown in the log below. I would like to know how much memory we actually need to host opt-175b.

Command to run

MODEL_PARALLEL=8   # while we resharded to 16, it's still technically 8 per node
NODES=2
srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \
    --quit-on-interrupt --job-name genwork \
    python3 -m metaseq_cli.interactive_hosted

initializing tensor model parallel with size 16 initializing pipeline model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1 2022-10-18 12:16:14 | INFO | metaseq.hub_utils | loading model(s) from /gpfs/u/home/AICD/AICDzhnf/scratch/new_shard_meta/new_shard_16/reshard.pt 2022-10-18 12:16:57 | INFO | metaseq.checkpoint_utils | Done reading from disk Traceback (most recent call last): File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 384, in cli_main() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main distributed_utils.call_main(cfg, worker_main, namespace_args=args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 272, in call_main return _spawn_helper(main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 250, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 212, in distributed_main retval = main(cfg, **kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main models = generator.load_model() # noqa: F841 File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 579, in load_model models, _model_args, _task = _load_checkpoint() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint return checkpoint_utils.load_model_ensemble_and_task( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task model = build_model_hook(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 553, in _build_model model = task.build_model(cfg.model).cuda() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/tasks/base_task.py", line 531, in build_model model = models.build_model(args, self) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/init.py", line 87, in build_model return model.build_model(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model decoder = TransformerDecoder( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 127, in init layers.append(self.build_decoder_layer(args)) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer layer = self.build_base_decoder_layer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer return TransformerDecoderLayer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 80, in init self.fc1 = self.build_fc1( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 118, in build_fc1 return Linear( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/linear.py", line 41, in init torch.empty(out_features, in_features, device=device, dtype=dtype) RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 31.75 GiB total capacity; 29.33 GiB already allocated; 645.75 MiB free; 29.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 384, in cli_main() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main distributed_utils.call_main(cfg, worker_main, namespace_args=args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 272, in call_main return _spawn_helper(main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 250, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 212, in distributed_main retval = main(cfg, **kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main models = generator.load_model() # noqa: F841 File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 579, in load_model models, _model_args, _task = _load_checkpoint() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint return checkpoint_utils.load_model_ensemble_and_task( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task model = build_model_hook(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 553, in _build_model model = task.build_model(cfg.model).cuda() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/tasks/base_task.py", line 531, in build_model model = models.build_model(args, self) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/init.py", line 87, in build_model return model.build_model(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model decoder = TransformerDecoder( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 127, in init layers.append(self.build_decoder_layer(args)) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer layer = self.build_base_decoder_layer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer return TransformerDecoderLayer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 80, in init self.fc1 = self.build_fc1( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 118, in build_fc1 return Linear( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/linear.py", line 41, in init torch.empty(out_features, in_features, device=device, dtype=dtype) RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 31.75 GiB total capacity; 29.33 GiB already allocated; 645.75 MiB free; 29.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF srun: error: npl15: task 0: Exited with exit code 1 srun: error: npl17: task 1: Exited with exit code 1

Oct 18 '22 16:10 zfchenUnique

have you properly installed apex for fp16 support? that's the first thing that comes to mind as to why you might be experiencing OOMs; 16 x 32gb GPUs is plenty of memory to at least load the model into memory (it's only 384gb and you're working with 512gb)

Nov 03 '22 15:11 klshuster

Dear klshuster,

    Thanks for your reply.
    I have installed the Apex library carefully, following https://github.com/facebookresearch/metaseq/blob/main/docs/setup.md .
    Is there any. other possibility  for the problem?

Regards, Zhenfang

Nov 03 '22 15:11 zfchenUnique

It seems that they are trying loading parameters of all the layer into the a single graphic card and then out of memory.

Nov 03 '22 15:11 zfchenUnique

Assuming we have the following:
CHECKPOINT=/path/to/fsdp_sharded_checkpoint/checkpoint_last
CONSOLIDATED=/path/to/new_consolidated_checkpoint/
RESHARDED=/path/to/new_resharded_checkpoint/
MP=16
Step 0

(Optional, if necessary) Consolidate the model from the FSDP shards into one checkpoint:
python consolidate_fsdp_shards.py $CHECKPOINT $CONSOLIDATED/consolidated 
Step 1

Use the model parallel reshard script to reshard the model into 16 parts.
python reshard_model_parallel.py $CONSOLIDATED/consolidated $MP --save-prefix $RESHARDED/reshard
Step 2

Update the constants file to point to the right paths
MODEL_PARALLEL = 16
TOTAL_WORLD_SIZE = 16
.
.
.
CHECKPOINT_FOLDER=$RESHARDED  # note, make sure you leave out the `reshard.pt`; that is added automatically
Step 3

SLURM command (adapted from api docs)
MODEL_PARALLEL=8   # while we resharded to 16, it's still technically 8 per node
NODES=2
srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \
    --quit-on-interrupt --job-name genwork \
    python3 -m metaseq_cli.interactive_hosted
Hi, Thank you for sharing this amazing repo and solution! I have been following the steps to inference opt-175b on two nodes of V100*8. However, it seems that I have an out-of-memory problem as shown in the log below. I would like to know how much memory we actually need to host opt-175b.

Command to run
MODEL_PARALLEL=8   # while we resharded to 16, it's still technically 8 per node
NODES=2
srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \
    --quit-on-interrupt --job-name genwork \
    python3 -m metaseq_cli.interactive_hosted
srun: job 26365 queued and waiting for resources srun: job 26365 has been allocated resources 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 0 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 3 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 10 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 13 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 11 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 7 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 4 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 2 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 12 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 9 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 8 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 15 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 1 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 14 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 6 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 5 ^X^[[A2022-10-18 12:16:11 | INFO | metaseq.distributed.utils | SLURM nodelist: npl[15,17]

initializing tensor model parallel with size 16 initializing pipeline model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1 2022-10-18 12:16:14 | INFO | metaseq.hub_utils | loading model(s) from /gpfs/u/home/AICD/AICDzhnf/scratch/new_shard_meta/new_shard_16/reshard.pt 2022-10-18 12:16:57 | INFO | metaseq.checkpoint_utils | Done reading from disk Traceback (most recent call last): File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 384, in cli_main() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main distributed_utils.call_main(cfg, worker_main, namespace_args=args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 272, in call_main return _spawn_helper(main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 250, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 212, in distributed_main retval = main(cfg, **kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main models = generator.load_model() # noqa: F841 File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 579, in load_model models, _model_args, _task = _load_checkpoint() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint return checkpoint_utils.load_model_ensemble_and_task( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task model = build_model_hook(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 553, in _build_model model = task.build_model(cfg.model).cuda() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/tasks/base_task.py", line 531, in build_model model = models.build_model(args, self) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/init.py", line 87, in build_model return model.build_model(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model decoder = TransformerDecoder( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 127, in init layers.append(self.build_decoder_layer(args)) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer layer = self.build_base_decoder_layer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer return TransformerDecoderLayer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 80, in init self.fc1 = self.build_fc1( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 118, in build_fc1 return Linear( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/linear.py", line 41, in init torch.empty(out_features, in_features, device=device, dtype=dtype) RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 31.75 GiB total capacity; 29.33 GiB already allocated; 645.75 MiB free; 29.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 384, in cli_main() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main distributed_utils.call_main(cfg, worker_main, namespace_args=args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 272, in call_main return _spawn_helper(main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 250, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 212, in distributed_main retval = main(cfg, **kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main models = generator.load_model() # noqa: F841 File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 579, in load_model models, _model_args, _task = _load_checkpoint() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint return checkpoint_utils.load_model_ensemble_and_task( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task model = build_model_hook(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 553, in _build_model model = task.build_model(cfg.model).cuda() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/tasks/base_task.py", line 531, in build_model model = models.build_model(args, self) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/init.py", line 87, in build_model return model.build_model(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model decoder = TransformerDecoder( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 127, in init layers.append(self.build_decoder_layer(args)) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer layer = self.build_base_decoder_layer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer return TransformerDecoderLayer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 80, in init self.fc1 = self.build_fc1( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 118, in build_fc1 return Linear( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/linear.py", line 41, in init torch.empty(out_features, in_features, device=device, dtype=dtype) RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 31.75 GiB total capacity; 29.33 GiB already allocated; 645.75 MiB free; 29.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF srun: error: npl15: task 0: Exited with exit code 1 srun: error: npl17: task 1: Exited with exit code 1

Hey bro, have u conquered this problem?

Jan 13 '23 15:01 dlnlpchenliyu

metaseq metaseq copied to clipboard

DGX 2 with 16 V100

What is your question?

What have you tried?

Step 0

Step 1

Step 2

Step 3

Step 0

Step 1

Step 2

Step 3

Step 0

Step 1

Step 2

Step 3

metaseq
metaseq copied to clipboard