Text generation fails on --devices 2
Hi, I am trying to generate text predictions using falcon-7b-instruct on machine with two A10-24GB gpu, when I run generate with default --devices option which is 1, it runs successfully while it fails with --device 2
python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b-instruct
default --devices
Loading model 'checkpoints/tiiuae/falcon-7b-instruct/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True}
Time to instantiate model: 0.15 seconds.
Time to load the model weights: 15.32 seconds.
Global seed set to 1234
Hello, my name is Jack.
Some people think that having a blog is a great way to make money online and others insist that it is not. In my own view, I do agree with the latter one.
But in the end, it will have to depend
Time for inference 1: 2.13 sec total, 23.47 tokens/sec
Memory used: 14.56 GB
python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b-instruct --devices 2
--devices 2
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------
Loading model 'checkpoints/tiiuae/falcon-7b-instruct/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True}
Time to instantiate model: 1.33 seconds.
Time to load the model weights: 16.37 seconds.
Traceback (most recent call last):
File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 204, in <module>
CLI(main)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
return _run_component(component, cfg_init)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
return component(**cfg)
File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 156, in main
model = fabric.setup_module(model)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 265, in setup_module
module = self._strategy.setup_module(module)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 121, in setup_module
return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 805, in __init__
self._ddp_init_helper(
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1095, in _ddp_init_helper
Traceback (most recent call last):
File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 204, in <module>
CLI(main)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.44 GiB. GPU 1 has a total capacty of 22.05 GiB of which 7.74 GiB is free. Including non-PyTorch memory, this process has 14.31 GiB memory in use. Of the allocated memory 13.49 GiB is allocated by PyTorch, and 49.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
return _run_component(component, cfg_init)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
return component(**cfg)
File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 156, in main
model = fabric.setup_module(model)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 265, in setup_module
module = self._strategy.setup_module(module)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 121, in setup_module
return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 805, in __init__
self._ddp_init_helper(
File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1095, in _ddp_init_helper
self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.44 GiB. GPU 0 has a total capacty of 22.05 GiB of which 7.74 GiB is free. Including non-PyTorch memory, this process has 14.31 GiB memory in use. Of the allocated memory 13.49 GiB is allocated by PyTorch, and 49.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Thanks for reporting this! I can repro and see that devices=2 requires 39.80 GB.
I'll investigate :microscope:
Oh just noticed what's the issue. If you don't pass a --strategy it'll choose DDP
You should add --strategy fsdp when using more than 1 device. This is explained in https://github.com/Lightning-AI/lit-parrot/blob/main/howto/inference.md#run-a-large-model-on-multiple-smaller-devices. If you do that, it will only use 12 GB
Thanks @carmocca, with --strategy fsdp it is able to use both the gpus & run successfully. I notice significant degradation in the tokens/sec for the multi gpu (2 in this case) Time for inference 1: 31.87 sec total, 1.57 tokens/sec Memory used: 12.45 GB which is slower compared to a single device run Time for inference 1: 2.15 sec total, 23.30 tokens/sec Memory used: 14.56 GB.
Yes, the speed degradation is expected as you are trading it off for lower memory requirements