litgpt Text generation fails on --devices 2

Hi, I am trying to generate text predictions using falcon-7b-instruct on machine with two A10-24GB gpu, when I run generate with default --devices option which is 1, it runs successfully while it fails with --device 2

python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b-instruct

default --devices

Loading model 'checkpoints/tiiuae/falcon-7b-instruct/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True}
Time to instantiate model: 0.15 seconds.
Time to load the model weights: 15.32 seconds.
Global seed set to 1234
Hello, my name is Jack.
Some people think that having a blog is a great way to make money online and others insist that it is not. In my own view, I do agree with the latter one.
But in the end, it will have to depend
Time for inference 1: 2.13 sec total, 23.47 tokens/sec
Memory used: 14.56 GB

python generate/base.py --prompt "Hello, my name is" --checkpoint_dir checkpoints/tiiuae/falcon-7b-instruct --devices 2

--devices 2

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

Loading model 'checkpoints/tiiuae/falcon-7b-instruct/lit_model.pth' with {'block_size': 2048, 'vocab_size': 50254, 'padding_multiple': 512, 'padded_vocab_size': 65024, 'n_layer': 32, 'n_head': 71, 'n_embd': 4544, 'rotary_percentage': 1.0, 'parallel_residual': True, 'bias': False, 'n_query_groups': 1, 'shared_attention_norm': True}
Time to instantiate model: 1.33 seconds.
Time to load the model weights: 16.37 seconds.
Traceback (most recent call last):
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 204, in <module>
    CLI(main)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
    return _run_component(component, cfg_init)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 156, in main
    model = fabric.setup_module(model)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 265, in setup_module
    module = self._strategy.setup_module(module)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 121, in setup_module
    return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 805, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1095, in _ddp_init_helper
Traceback (most recent call last):
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 204, in <module>
    CLI(main)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 85, in CLI
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.44 GiB. GPU 1 has a total capacty of 22.05 GiB of which 7.74 GiB is free. Including non-PyTorch memory, this process has 14.31 GiB memory in use. Of the allocated memory 13.49 GiB is allocated by PyTorch, and 49.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
    return _run_component(component, cfg_init)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/jsonargparse/cli.py", line 147, in _run_component
    return component(**cfg)
  File "/home/ubuntu/llm-repos/lit-parrot/generate/base.py", line 156, in main
    model = fabric.setup_module(model)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/fabric.py", line 265, in setup_module
    module = self._strategy.setup_module(module)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/lightning/fabric/strategies/ddp.py", line 121, in setup_module
    return DistributedDataParallel(module=module, device_ids=device_ids, **self._ddp_kwargs)
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 805, in __init__
    self._ddp_init_helper(
  File "/home/ubuntu/miniconda3/envs/litpar/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1095, in _ddp_init_helper
    self.reducer = dist.Reducer(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.44 GiB. GPU 0 has a total capacty of 22.05 GiB of which 7.74 GiB is free. Including non-PyTorch memory, this process has 14.31 GiB memory in use. Of the allocated memory 13.49 GiB is allocated by PyTorch, and 49.67 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Jun 12 '23 17:06 codeloop

Thanks for reporting this! I can repro and see that devices=2 requires 39.80 GB.

I'll investigate :microscope:

Jun 15 '23 01:06 carmocca

Oh just noticed what's the issue. If you don't pass a --strategy it'll choose DDP

You should add --strategy fsdp when using more than 1 device. This is explained in https://github.com/Lightning-AI/lit-parrot/blob/main/howto/inference.md#run-a-large-model-on-multiple-smaller-devices. If you do that, it will only use 12 GB

Jun 15 '23 02:06 carmocca

Thanks @carmocca, with --strategy fsdp it is able to use both the gpus & run successfully. I notice significant degradation in the tokens/sec for the multi gpu (2 in this case) Time for inference 1: 31.87 sec total, 1.57 tokens/sec Memory used: 12.45 GB which is slower compared to a single device run Time for inference 1: 2.15 sec total, 23.30 tokens/sec Memory used: 14.56 GB.

Jun 17 '23 03:06 codeloop

Yes, the speed degradation is expected as you are trading it off for lower memory requirements

Jun 19 '23 18:06 carmocca