metaseq
metaseq copied to clipboard
DGX 2 with 16 V100
What is your question?
We have a DGX 2 with 16 V100 32GB and would like to run the 175B model from OPT. How can is reshard a consolidated .pt to 16 shard to overcome the memory limit?
What have you tried?
Loading the model on 8 graphic cards but this results in OOM errors
Thanks, I have two people trying this on smaller hardware now to see if it works. I'll update with their findings (if you don't hear back further in a few days, @ me in this issue)
Internally, we have (roughly) this code for launching the API on slurm:
def _spawn_workers(num_nodes=1):
model_parallel = 8
job_name = "api"
cmd_template = f"""
sbatch --ntasks-per-node {model_parallel} --gpus-per-node 8 --nodes {num_nodes} \\
-n {model_parallel} -c 10 \\
--mem 400gb --job-name {job_name} --open-mode append \\
--wrap "srun --quit-on-interrupt -K1 python3 -m metaseq_cli.interactive_hosted"
""".strip()
subprocess.Popen([cmd], shell=True)
I think (but haven't explicitly tested in some time) that just changing num_nodes=2 will work.
Oh, -n {model_parallel}
may need to be come -n {num_nodes * model_parallel}
I tried with world size 16 but it just stucks with 8 processes currently :( I will wait a bit till you have some findings :) Thank you!
Did you try the change to the -n
args? Can you paste a log for us?
I am sorry we experience currently problems with our slurmd, I will provide logs as soon as we fixed the problem with it. Yes i changed the -n to 16.
Just tried:
size mismatch for _fpw_module.decoder.layers.95._fsdp_wrapped_module.flat_param_0: copying a param with shape torch.Size([226576896]) from checkpoint, the shape in current model is torch.Size([113288448]).
YES, this is my issue!
Watching this issue with interest, for hardware planning.
With 16x 32GB GPU RAM = 512GB, should that be enough to load the model in inference? In principle it seems like you would need 175x4= 700GB.
A natural configuration is 8 80GB A100's, but it also seems slightly short - will it be possible to do inference with 175B on 8 x 80GB = 640GB GPU RAM?
@klshuster has made some progress on this, is fighting one last demon.
Update: @klshuster managed to get things to run with FSDP+MP, but it's very slow.
@ngoyal2707 is working on resharding to >8 MP workers, which should significantly improve throughput. Not quite as close as 1 giant machine but should be closer.
Update: we got MP16 working on AWS earlier today. Latency was 120ms/token compared to MP8's 78ms/token.
All credit to @klshuster and @ngoyal2707 for their hard work in delivering this. We'll post instructions soon.
Any update on instructions for this?
+1 for any update on instructions for this?
Assuming we have the following:
CHECKPOINT=/path/to/fsdp_sharded_checkpoint/checkpoint_last
CONSOLIDATED=/path/to/new_consolidated_checkpoint/
RESHARDED=/path/to/new_resharded_checkpoint/
MP=16
Step 0
(Optional, if necessary) Consolidate the model from the FSDP shards into one checkpoint:
python consolidate_fsdp_shards.py $CHECKPOINT $CONSOLIDATED/consolidated
Step 1
Use the model parallel reshard script to reshard the model into 16 parts.
python reshard_model_parallel.py $CONSOLIDATED/consolidated $MP --save-prefix $RESHARDED/reshard
Step 2
Update the constants file to point to the right paths
MODEL_PARALLEL = 16
TOTAL_WORLD_SIZE = 16
.
.
.
CHECKPOINT_FOLDER=$RESHARDED # note, make sure you leave out the `reshard.pt`; that is added automatically
Step 3
SLURM command (adapted from api docs)
MODEL_PARALLEL=8 # while we resharded to 16, it's still technically 8 per node
NODES=2
srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \
--quit-on-interrupt --job-name genwork \
python3 -m metaseq_cli.interactive_hosted
@klshuster One clarification question on Step 0. With consolidate_fsdp_shards.py, the default will produce model with mp=1, do I need to pass argument of '--new-arch-name transformer_lm_gpt' ? Or it is fine to still initialize model as megatron model?
still fine to init as megatron model
Assuming we have the following:
CHECKPOINT=/path/to/fsdp_sharded_checkpoint/checkpoint_last CONSOLIDATED=/path/to/new_consolidated_checkpoint/ RESHARDED=/path/to/new_resharded_checkpoint/ MP=16
Step 0
(Optional, if necessary) Consolidate the model from the FSDP shards into one checkpoint:
python consolidate_fsdp_shards.py $CHECKPOINT $CONSOLIDATED/consolidated
Step 1
Use the model parallel reshard script to reshard the model into 16 parts.
python reshard_model_parallel.py $CONSOLIDATED/consolidated $MP --save-prefix $RESHARDED/reshard
Step 2
Update the constants file to point to the right paths
MODEL_PARALLEL = 16 TOTAL_WORLD_SIZE = 16 . . . CHECKPOINT_FOLDER=$RESHARDED # note, make sure you leave out the `reshard.pt`; that is added automatically
Step 3
SLURM command (adapted from api docs)
MODEL_PARALLEL=8 # while we resharded to 16, it's still technically 8 per node NODES=2 srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \ --quit-on-interrupt --job-name genwork \ python3 -m metaseq_cli.interactive_hosted
Hi, Thank you for sharing this amazing repo and solution! I have been following the steps to inference opt-175b on two nodes of V100*8. However, it seems that I have an out-of-memory problem as shown in the log below. I would like to know how much memory we actually need to host opt-175b.
Command to run
MODEL_PARALLEL=8 # while we resharded to 16, it's still technically 8 per node
NODES=2
srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \
--quit-on-interrupt --job-name genwork \
python3 -m metaseq_cli.interactive_hosted
srun: job 26365 queued and waiting for resources srun: job 26365 has been allocated resources 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 0 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 3 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 10 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 13 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 11 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 7 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 4 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 2 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 12 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 9 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 8 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 15 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 1 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 14 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 6 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 5 ^X^[[A2022-10-18 12:16:11 | INFO | metaseq.distributed.utils | SLURM nodelist: npl[15,17]
initializing tensor model parallel with size 16 initializing pipeline model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1 2022-10-18 12:16:14 | INFO | metaseq.hub_utils | loading model(s) from /gpfs/u/home/AICD/AICDzhnf/scratch/new_shard_meta/new_shard_16/reshard.pt 2022-10-18 12:16:57 | INFO | metaseq.checkpoint_utils | Done reading from disk Traceback (most recent call last): File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 384, in
cli_main() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main distributed_utils.call_main(cfg, worker_main, namespace_args=args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 272, in call_main return _spawn_helper(main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 250, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 212, in distributed_main retval = main(cfg, **kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main models = generator.load_model() # noqa: F841 File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 579, in load_model models, _model_args, _task = _load_checkpoint() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint return checkpoint_utils.load_model_ensemble_and_task( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task model = build_model_hook(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 553, in _build_model model = task.build_model(cfg.model).cuda() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/tasks/base_task.py", line 531, in build_model model = models.build_model(args, self) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/init.py", line 87, in build_model return model.build_model(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model decoder = TransformerDecoder( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 127, in init layers.append(self.build_decoder_layer(args)) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer layer = self.build_base_decoder_layer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer return TransformerDecoderLayer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 80, in init self.fc1 = self.build_fc1( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 118, in build_fc1 return Linear( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/linear.py", line 41, in init torch.empty(out_features, in_features, device=device, dtype=dtype) RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 31.75 GiB total capacity; 29.33 GiB already allocated; 645.75 MiB free; 29.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 384, in cli_main() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main distributed_utils.call_main(cfg, worker_main, namespace_args=args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 272, in call_main return _spawn_helper(main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 250, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 212, in distributed_main retval = main(cfg, **kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main models = generator.load_model() # noqa: F841 File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 579, in load_model models, _model_args, _task = _load_checkpoint() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint return checkpoint_utils.load_model_ensemble_and_task( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task model = build_model_hook(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 553, in _build_model model = task.build_model(cfg.model).cuda() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/tasks/base_task.py", line 531, in build_model model = models.build_model(args, self) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/init.py", line 87, in build_model return model.build_model(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model decoder = TransformerDecoder( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 127, in init layers.append(self.build_decoder_layer(args)) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer layer = self.build_base_decoder_layer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer return TransformerDecoderLayer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 80, in init self.fc1 = self.build_fc1( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 118, in build_fc1 return Linear( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/linear.py", line 41, in init torch.empty(out_features, in_features, device=device, dtype=dtype) RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 31.75 GiB total capacity; 29.33 GiB already allocated; 645.75 MiB free; 29.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF srun: error: npl15: task 0: Exited with exit code 1 srun: error: npl17: task 1: Exited with exit code 1
have you properly installed apex for fp16 support? that's the first thing that comes to mind as to why you might be experiencing OOMs; 16 x 32gb GPUs is plenty of memory to at least load the model into memory (it's only 384gb and you're working with 512gb)
Dear klshuster,
Thanks for your reply.
I have installed the Apex library carefully, following https://github.com/facebookresearch/metaseq/blob/main/docs/setup.md .
Is there any. other possibility for the problem?
Regards, Zhenfang
It seems that they are trying loading parameters of all the layer into the a single graphic card and then out of memory.
Assuming we have the following:
CHECKPOINT=/path/to/fsdp_sharded_checkpoint/checkpoint_last CONSOLIDATED=/path/to/new_consolidated_checkpoint/ RESHARDED=/path/to/new_resharded_checkpoint/ MP=16
Step 0
(Optional, if necessary) Consolidate the model from the FSDP shards into one checkpoint:
python consolidate_fsdp_shards.py $CHECKPOINT $CONSOLIDATED/consolidated
Step 1
Use the model parallel reshard script to reshard the model into 16 parts.
python reshard_model_parallel.py $CONSOLIDATED/consolidated $MP --save-prefix $RESHARDED/reshard
Step 2
Update the constants file to point to the right paths
MODEL_PARALLEL = 16 TOTAL_WORLD_SIZE = 16 . . . CHECKPOINT_FOLDER=$RESHARDED # note, make sure you leave out the `reshard.pt`; that is added automatically
Step 3
SLURM command (adapted from api docs)
MODEL_PARALLEL=8 # while we resharded to 16, it's still technically 8 per node NODES=2 srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \ --quit-on-interrupt --job-name genwork \ python3 -m metaseq_cli.interactive_hosted
Hi, Thank you for sharing this amazing repo and solution! I have been following the steps to inference opt-175b on two nodes of V100*8. However, it seems that I have an out-of-memory problem as shown in the log below. I would like to know how much memory we actually need to host opt-175b.
Command to run
MODEL_PARALLEL=8 # while we resharded to 16, it's still technically 8 per node NODES=2 srun --ntasks-per-node 1 --gpus-per-node $MODEL_PARALLEL --nodes $NODES --cpus-per-task 8 --mem 400gb \ --quit-on-interrupt --job-name genwork \ python3 -m metaseq_cli.interactive_hosted
srun: job 26365 queued and waiting for resources srun: job 26365 has been allocated resources 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 0 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 3 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 10 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 13 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 11 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 7 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 4 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 2 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 12 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 9 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 8 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 15 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 1 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl17 as rank 14 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 6 2022-10-18 12:16:01 | INFO | metaseq.distributed.utils | initialized host npl15 as rank 5 ^X^[[A2022-10-18 12:16:11 | INFO | metaseq.distributed.utils | SLURM nodelist: npl[15,17]
initializing tensor model parallel with size 16 initializing pipeline model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 2719 and data parallel seed: 1 2022-10-18 12:16:14 | INFO | metaseq.hub_utils | loading model(s) from /gpfs/u/home/AICD/AICDzhnf/scratch/new_shard_meta/new_shard_16/reshard.pt 2022-10-18 12:16:57 | INFO | metaseq.checkpoint_utils | Done reading from disk Traceback (most recent call last): File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 384, in cli_main() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main distributed_utils.call_main(cfg, worker_main, namespace_args=args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 272, in call_main return _spawn_helper(main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 250, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 212, in distributed_main retval = main(cfg, **kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main models = generator.load_model() # noqa: F841 File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 579, in load_model models, _model_args, _task = _load_checkpoint() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint return checkpoint_utils.load_model_ensemble_and_task( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task model = build_model_hook(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 553, in _build_model model = task.build_model(cfg.model).cuda() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/tasks/base_task.py", line 531, in build_model model = models.build_model(args, self) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/init.py", line 87, in build_model return model.build_model(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model decoder = TransformerDecoder( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 127, in init layers.append(self.build_decoder_layer(args)) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer layer = self.build_base_decoder_layer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer return TransformerDecoderLayer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 80, in init self.fc1 = self.build_fc1( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 118, in build_fc1 return Linear( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/linear.py", line 41, in init torch.empty(out_features, in_features, device=device, dtype=dtype) RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 31.75 GiB total capacity; 29.33 GiB already allocated; 645.75 MiB free; 29.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF Traceback (most recent call last): File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/gpfs/u/home/AICD/AICDzhnf/scratch/x64/anaconda3/lib/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 384, in cli_main() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 370, in cli_main distributed_utils.call_main(cfg, worker_main, namespace_args=args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 272, in call_main return _spawn_helper(main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 250, in _spawn_helper retval = distributed_main(-1, main, cfg, kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/distributed/utils.py", line 212, in distributed_main retval = main(cfg, **kwargs) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq_cli/interactive_hosted.py", line 176, in worker_main models = generator.load_model() # noqa: F841 File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 579, in load_model models, _model_args, _task = _load_checkpoint() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 562, in _load_checkpoint return checkpoint_utils.load_model_ensemble_and_task( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/checkpoint_utils.py", line 488, in load_model_ensemble_and_task model = build_model_hook(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/hub_utils.py", line 553, in _build_model model = task.build_model(cfg.model).cuda() File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/tasks/base_task.py", line 531, in build_model model = models.build_model(args, self) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/init.py", line 87, in build_model return model.build_model(cfg, task) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_lm.py", line 185, in build_model decoder = TransformerDecoder( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 127, in init layers.append(self.build_decoder_layer(args)) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 253, in build_decoder_layer layer = self.build_base_decoder_layer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/models/transformer_decoder.py", line 250, in build_base_decoder_layer return TransformerDecoderLayer(args) File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 80, in init self.fc1 = self.build_fc1( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/transformer_decoder_layer.py", line 118, in build_fc1 return Linear( File "/gpfs/u/scratch/AICD/AICDzhnf/opt/metaseq/metaseq/modules/linear.py", line 41, in init torch.empty(out_features, in_features, device=device, dtype=dtype) RuntimeError: CUDA out of memory. Tried to allocate 1.12 GiB (GPU 0; 31.75 GiB total capacity; 29.33 GiB already allocated; 645.75 MiB free; 29.33 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF srun: error: npl15: task 0: Exited with exit code 1 srun: error: npl17: task 1: Exited with exit code 1
Hey bro, have u conquered this problem?