lightning-thunder RuntimeError and Socket Connection Failure when Benchmarking Gemma-7b with Micro Batch Size 1

🐛 Bug

When using benchmark_litgpt Following error occurs:

Exception: Unexpected error occurred for {\'--micro_batch_size\': 1} due to [W803 20:39:12.165884696 socket.cpp:752] [c10d] The client socket has failed to connect to [eos0157.eos.clusters.nvidia.com]:59504 (errno: 22 - Invalid argument).
[rank4]: RuntimeError: Devices were expected to be the same, but got devices thunder.devices.Device(type=\'cuda\', index=4) and thunder.devices.Device(type=\'cpu\')!

To Reproduce

Start interactive job on a cluster:

srun -A YOUR_SLURM_ACCOUNT -J YOUR_SLURM_ACCOUNT-thunder.lit-gpt -N1 -p batch --container-image=INTERNAL_IMAGE:pjnl-20240801 --pty bash

Then execute:

torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
    --model_name Gemma-7b \
    --distributed_mode fsdp \
    --shard_mode zero2 \
    --compile thunder_cudnn \
    --checkpoint_activations False \
    --low_precision_mode fp8-delayed-te \
    --save_logs_for_all_batches True \
    --micro_batch_size 1

Expected Behavior

When running the benchmark for the Gemma-7b model with the specified configurations, the benchmarking scripts should successfully execute without any errors.

Environment

system.device_product_name                      DGXH100
system.gpu_driver_version                    535.129.03
libraries.cuda                               12.6.0.021
libraries.pip.lightning               2.4.0.dev20240728
libraries.pip.lightning-thunder              0.2.0.dev0
libraries.pip.lightning-utilities                0.11.6
libraries.pip.litgpt                              0.4.7
libraries.pip.nvfuser                  0.2.8+git671171f
libraries.pip.pytorch-lightning                   2.3.3
libraries.pip.torch                  2.5.0a0+gita94e507
libraries.pip.torchmetrics                  1.4.0.post0
libraries.pip.torchvision              0.19.0a0+d23a6e1

Additional context

Comment: seems like a thunder bug somewhere; something is somehow ending up on the CPU even though it shouldn’t be.

Aug 08 '24 09:08 mjmikulski

It seems I didn't reproduce this error on H100 80GB, instead I got OOM (I removed the --save_logs_for_all_batches True)

container: pjnl-20240801
lightning-thunder      0.2.0.dev0          /opt/pytorch/lightning-thunder
nvfuser                0.2.8+git671171f    /opt/pytorch/nvfuser

root@803c226ee238:/opt/pytorch/lightning-thunder# torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py     --model_name Gemma-7b     --distributed_mode fsdp     --shard_mode zero2     --compile thunder_cudnn     --checkpoint_activations False     --low_precision_mode fp8-delayed-te     --micro_batch_size 1
W0808 13:28:05.563000 931 torch/distributed/run.py:793]
W0808 13:28:05.563000 931 torch/distributed/run.py:793] *****************************************
W0808 13:28:05.563000 931 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0808 13:28:05.563000 931 torch/distributed/run.py:793] *****************************************
Loading model with {'name': 'Gemma-7b', 'hf_config': {'org': 'google', 'name': 'gemma-7b'}, 'scale_embeddings': True, 'block_size': 4096, 'vocab_size': 256000, 'padding_multiple': 64, 'padded_vocab_size': 256000, 'n_layer': 28, 'n_head': 16, 'head_size': 256, 'n_embd': 3072, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 16, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'GemmaMLP', 'gelu_approximate': 'tanh', 'intermediate_size': 24576, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 256}
Time to instantiate model: 0.02 seconds.
...
[rank6]: Traceback (most recent call last):
[rank6]:   File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 628, in <module>
[rank6]:     CLI(benchmark_main)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 96, in CLI
[rank6]:     return _run_component(components, init)
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank6]:     return component(**cfg)
[rank6]:   File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 583, in benchmark_main
[rank6]:     benchmark.train()
[rank6]:   File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 495, in train
[rank6]:     loss.backward()
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 522, in backward
[rank6]:     torch.autograd.backward(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 346, in backward
[rank6]:     _engine_run_backward(
[rank6]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 812, in _engine_run_backward
[rank6]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 6 has a total capacity of 79.10 GiB of which 1.03 GiB is free. Process 2592558 has 78.06 GiB memory in use. Of the allocated memory 75.47 GiB is allocated by PyTorch, and 674.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...

Aug 08 '24 14:08 kiya00

Thank you for the answer. How many GPUs did you use? AFAIK this was executed on a single node with 8 GPUs (H100).

Aug 09 '24 08:08 mjmikulski

Yes, I used 1 node with 8 H100(80G). Has anyone else tried it to see if it's reproducible?

Aug 09 '24 11:08 kiya00

We see this error in recent runs as well. I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add --n_layers 1 flag to reduce memory usage? Here is the command I used with image from 20240814

torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py     --model_name Gemma-7b     --distributed_mode fsdp     --shard_mode zero2     --compile thunder_cudnn     --checkpoint_activations False     --low_precision_mode fp8-delayed-te     --micro_batch_size 1 --n_layers 1

Aug 20 '24 07:08 mpatel31415

I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add --n_layers 1 flag to reduce memory usage?

With n_layers 1 can this reproduce with 1 GPU? probably easier to find 1 H100 rather than a whole node with 8.

Aug 21 '24 16:08 tfogal

Yes, the same error is present when running: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --compile thunder_cudnn --low_precision_mode fp8-delayed-te --micro_batch_size 1 --n_layers 1

Aug 22 '24 10:08 mpatel31415

@kiya00, could you please take a look at this problem and tell what's needed for a fix?

Aug 26 '24 15:08 IvanYashchuk

Per triage meeting on 8/26, moved to @t-vi and assigned priority as P2.

Aug 26 '24 18:08 nvMelissa

It can be reproduced in container pjnl-20240830-mixology_70d843cd but not pjnl-20240830

Aug 29 '24 14:08 kiya00

A minimum reproduce is:

import torch, thunder
def fun(x):
    x = x * torch.tensor(0.5, dtype=x.dtype)
    return x
x = torch.randn((2,2),dtype=torch.bfloat16).cuda()
# print(fun(x))
jfun=thunder.jit(fun)
jfun(x)

Torch can run cuda tensor * cpu scalar tensor, but Thunder can't

Sep 03 '24 13:09 kiya00

The problem can also be fixed by modifying the LitGPT code here https://github.com/Lightning-AI/litgpt/blob/1d37f9a99bb4ba2b7373bc7fc5b8c5a457af48df/litgpt/model.py#L95

+ x = x * torch.tensor(self.config.n_embd**0.5, dtype=x.dtype, x.device)
- x = x * torch.tensor(self.config.n_embd**0.5, dtype=x.dtype)

2 months ago this line used a Python scalar and that's why it was working:

x = x * (self.config.n_embd**0.5)

Sep 06 '24 13:09 IvanYashchuk