RuntimeError and Socket Connection Failure when Benchmarking Gemma-7b with Micro Batch Size 1
🐛 Bug
When using benchmark_litgpt Following error occurs:
Exception: Unexpected error occurred for {\'--micro_batch_size\': 1} due to [W803 20:39:12.165884696 socket.cpp:752] [c10d] The client socket has failed to connect to [eos0157.eos.clusters.nvidia.com]:59504 (errno: 22 - Invalid argument).
[rank4]: RuntimeError: Devices were expected to be the same, but got devices thunder.devices.Device(type=\'cuda\', index=4) and thunder.devices.Device(type=\'cpu\')!
To Reproduce
Start interactive job on a cluster:
srun -A YOUR_SLURM_ACCOUNT -J YOUR_SLURM_ACCOUNT-thunder.lit-gpt -N1 -p batch --container-image=INTERNAL_IMAGE:pjnl-20240801 --pty bash
Then execute:
torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \
--model_name Gemma-7b \
--distributed_mode fsdp \
--shard_mode zero2 \
--compile thunder_cudnn \
--checkpoint_activations False \
--low_precision_mode fp8-delayed-te \
--save_logs_for_all_batches True \
--micro_batch_size 1
Expected Behavior
When running the benchmark for the Gemma-7b model with the specified configurations, the benchmarking scripts should successfully execute without any errors.
Environment
system.device_product_name DGXH100
system.gpu_driver_version 535.129.03
libraries.cuda 12.6.0.021
libraries.pip.lightning 2.4.0.dev20240728
libraries.pip.lightning-thunder 0.2.0.dev0
libraries.pip.lightning-utilities 0.11.6
libraries.pip.litgpt 0.4.7
libraries.pip.nvfuser 0.2.8+git671171f
libraries.pip.pytorch-lightning 2.3.3
libraries.pip.torch 2.5.0a0+gita94e507
libraries.pip.torchmetrics 1.4.0.post0
libraries.pip.torchvision 0.19.0a0+d23a6e1
Additional context
Comment: seems like a thunder bug somewhere; something is somehow ending up on the CPU even though it shouldn’t be.
It seems I didn't reproduce this error on H100 80GB, instead I got OOM (I removed the --save_logs_for_all_batches True)
container: pjnl-20240801
lightning-thunder 0.2.0.dev0 /opt/pytorch/lightning-thunder
nvfuser 0.2.8+git671171f /opt/pytorch/nvfuser
root@803c226ee238:/opt/pytorch/lightning-thunder# torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --distributed_mode fsdp --shard_mode zero2 --compile thunder_cudnn --checkpoint_activations False --low_precision_mode fp8-delayed-te --micro_batch_size 1
W0808 13:28:05.563000 931 torch/distributed/run.py:793]
W0808 13:28:05.563000 931 torch/distributed/run.py:793] *****************************************
W0808 13:28:05.563000 931 torch/distributed/run.py:793] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0808 13:28:05.563000 931 torch/distributed/run.py:793] *****************************************
Loading model with {'name': 'Gemma-7b', 'hf_config': {'org': 'google', 'name': 'gemma-7b'}, 'scale_embeddings': True, 'block_size': 4096, 'vocab_size': 256000, 'padding_multiple': 64, 'padded_vocab_size': 256000, 'n_layer': 28, 'n_head': 16, 'head_size': 256, 'n_embd': 3072, 'rotary_percentage': 1.0, 'parallel_residual': False, 'bias': False, 'lm_head_bias': False, 'n_query_groups': 16, 'shared_attention_norm': False, 'norm_class_name': 'RMSNorm', 'norm_eps': 1e-05, 'mlp_class_name': 'GemmaMLP', 'gelu_approximate': 'tanh', 'intermediate_size': 24576, 'rope_condense_ratio': 1, 'rope_base': 10000, 'n_expert': 0, 'n_expert_per_token': 0, 'rope_n_elem': 256}
Time to instantiate model: 0.02 seconds.
...
[rank6]: Traceback (most recent call last):
[rank6]: File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 628, in <module>
[rank6]: CLI(benchmark_main)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 96, in CLI
[rank6]: return _run_component(components, init)
[rank6]: File "/usr/local/lib/python3.10/dist-packages/jsonargparse/_cli.py", line 204, in _run_component
[rank6]: return component(**cfg)
[rank6]: File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 583, in benchmark_main
[rank6]: benchmark.train()
[rank6]: File "/opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py", line 495, in train
[rank6]: loss.backward()
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 522, in backward
[rank6]: torch.autograd.backward(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 346, in backward
[rank6]: _engine_run_backward(
[rank6]: File "/usr/local/lib/python3.10/dist-packages/torch/autograd/graph.py", line 812, in _engine_run_backward
[rank6]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank6]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB. GPU 6 has a total capacity of 79.10 GiB of which 1.03 GiB is free. Process 2592558 has 78.06 GiB memory in use. Of the allocated memory 75.47 GiB is allocated by PyTorch, and 674.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
...
Thank you for the answer. How many GPUs did you use? AFAIK this was executed on a single node with 8 GPUs (H100).
Yes, I used 1 node with 8 H100(80G). Has anyone else tried it to see if it's reproducible?
We see this error in recent runs as well. I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add --n_layers 1 flag to reduce memory usage?
Here is the command I used with image from 20240814
torchrun --standalone --max-restarts=0 --no-python --nproc-per-node=8 python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --distributed_mode fsdp --shard_mode zero2 --compile thunder_cudnn --checkpoint_activations False --low_precision_mode fp8-delayed-te --micro_batch_size 1 --n_layers 1
I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add --n_layers 1 flag to reduce memory usage?
With n_layers 1 can this reproduce with 1 GPU? probably easier to find 1 H100 rather than a whole node with 8.
Yes, the same error is present when running:
python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --compile thunder_cudnn --low_precision_mode fp8-delayed-te --micro_batch_size 1 --n_layers 1
@kiya00, could you please take a look at this problem and tell what's needed for a fix?
Per triage meeting on 8/26, moved to @t-vi and assigned priority as P2.
It can be reproduced in container pjnl-20240830-mixology_70d843cd but not pjnl-20240830
A minimum reproduce is:
import torch, thunder
def fun(x):
x = x * torch.tensor(0.5, dtype=x.dtype)
return x
x = torch.randn((2,2),dtype=torch.bfloat16).cuda()
# print(fun(x))
jfun=thunder.jit(fun)
jfun(x)
Torch can run cuda tensor * cpu scalar tensor, but Thunder can't
The problem can also be fixed by modifying the LitGPT code here https://github.com/Lightning-AI/litgpt/blob/1d37f9a99bb4ba2b7373bc7fc5b8c5a457af48df/litgpt/model.py#L95
+ x = x * torch.tensor(self.config.n_embd**0.5, dtype=x.dtype, x.device)
- x = x * torch.tensor(self.config.n_embd**0.5, dtype=x.dtype)
2 months ago this line used a Python scalar and that's why it was working:
x = x * (self.config.n_embd**0.5)