lightning-thunder Different shapes, values of model weights and losses between FSDP training in Eager mode and with Thunder

🐛 Bug

After training Llama-3-8b on 8 A100 for 10 iterations with eager mode I printed the model weights:

torch_dist.barrier()
weights_after_training = benchmark.model.lm_head.weight[:10].data.to(device="cpu", dtype=torch.float32).numpy()
if global_rank in [0, None]:
    print(f"WEIGHTS:\n{weights_after_training.shape}\n{weights_after_training}")

when not using Thunder I got:

WEIGHTS: (10,) [ 0.01855469 0.00598145 0.01312256 0.01300049 0.00262451 0.0055542 -0.01104736 0.00076294 0.01202393 -0.00909424]

when using Thunder I got:

WEIGHTS: (10, 4096) [[ 0.01281738 0.00582886 0.01342773 ... -0.01196289 -0.00369263 -0.01287842] [-0.00331116 0.01647949 -0.01452637 ... -0.01696777 -0.00650024 -0.00145721] [ 0.01867676 0.00334167 0.00133514 ... -0.00531006 -0.00744629 0.01147461] ... [-0.01019287 -0.00939941 0.00204468 ... 0.01184082 0.00201416 -0.01104736] [-0.00643921 0.00318909 0.01623535 ... -0.00148773 0.01153564 -0.01086426] [-0.00921631 -0.01452637 0.01586914 ... -0.01330566 0.00445557 0.00692749]]

So the shape and values are different. I checked and executing the training script multiple times gives consistent results (so it's not a problem with randomness).

To Reproduce

Start container by running:

docker run --pull=always --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it INTERNAL_IMAGE:pjnl-20240724

Later create in the container training script from the file linked to this issue or add in line 584 of the script lightning-thunder/thunder/benchmarks/benchmark_litgpt.py 4 lines of code from the bug description.
Assuming the newly created script is called benchmark_litgpt.py call:

For Eager

torchrun --standalone --max-restarts=0 --nproc-per-node=8  benchmark_litgpt.py  --model_name Llama-3-8B --max_iters 10 --warmup_iters 2 --distributed_mode fsdp --shard_mode zero3 --bucketing_mode block &> file_eager_1.txt

For Thunder

torchrun --standalone --max-restarts=0 --nproc-per-node=8  benchmark_litgpt.py  --model_name Llama-3-8B --max_iters 10 --warmup_iters 2 --distributed_mode fsdp --shard_mode zero3 --bucketing_mode block --compile thunder &> file_thunder_1.txt

The results will be visible in file_eager_1.txt and file_thunder_1.txt

Expected behavior

Shapes and values of the model weights should be the same

Environment

Output from nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:07:00.0 Off |                    0 |
| N/A   30C    P0             60W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  |   00000000:0F:00.0 Off |                    0 |
| N/A   29C    P0             58W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  |   00000000:47:00.0 Off |                    0 |
| N/A   29C    P0             58W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  |   00000000:4E:00.0 Off |                    0 |
| N/A   31C    P0             62W /  400W |    2757MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA A100-SXM4-80GB          On  |   00000000:87:00.0 Off |                    0 |
| N/A   33C    P0             59W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA A100-SXM4-80GB          On  |   00000000:90:00.0 Off |                    0 |
| N/A   33C    P0             62W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA A100-SXM4-80GB          On  |   00000000:B7:00.0 Off |                    0 |
| N/A   33C    P0             62W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA A100-SXM4-80GB          On  |   00000000:BD:00.0 Off |                    0 |
| N/A   33C    P0             60W /  400W |       3MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

Python packages:

lightning 2.3.3 lightning-thunder 0.2.0.dev0
lightning-utilities 0.11.6 litgpt 0.4.5 nvfuser 0.2.8+gitfa2bedc
nvidia-cudnn-frontend 1.5.2 nvidia-pyprof 3.11.0 pytorch-lightning 2.3.3 torch 2.5.0a0+git16a2a1a torchmetrics 1.4.0.post0 torchvision 0.19.0a0+d23a6e1

Additional context

(This is py file, but to attach it here I had to change the extension) benchmark_litgpt.txt

cc @carmocca @crcrpar

Jul 25 '24 13:07 mpatel31415

This is expected, as we leave the provided tensors alone (but will change the recommended init scheme), please use benchmark.model.get_parameter('lm_head.weight')[:10] to get the sharded weights.

Jul 25 '24 20:07 t-vi

I tested it and benchmark.model.get_parameter('lm_head.weight')[:10] still gives shape [10, 4096] for Thunder and [10] for Eager. Also it is expected that the values of parameters are different between Thunder and Eager?

Jul 26 '24 08:07 mpatel31415

Is model the original model or the thunder module?

Jul 26 '24 09:07 t-vi

In case of Thunder it's Thunder module: thunder.core.module.ThunderModule. IN case of Eager it's the original module.

Jul 26 '24 09:07 mpatel31415

The value of loss is also different between Thunder and Eager: Eager:

iter 0: loss 11.9375, iter time: 6618.87ms, t: 8192 iter 1: loss 9.8750, iter time: 1466.43ms, t: 8192 iter 2: loss 5.9375, iter time: 1097.02ms, t: 8192 iter 3: loss 4.8125, iter time: 1096.80ms, t: 8192 iter 4: loss 4.6875, iter time: 1093.69ms, t: 8192 iter 5: loss 4.6875, iter time: 1098.55ms, t: 8192 iter 6: loss 4.6562, iter time: 1096.84ms, t: 8192 iter 7: loss 4.6250, iter time: 1098.75ms, t: 8192 iter 8: loss 4.6562, iter time: 1186.29ms, t: 8192 iter 9: loss 4.6562, iter time: 1106.99ms, t: 8192

Thunder

iter 0: loss 11.8750, iter time: 73451.35ms, t: 8192 iter 1: loss 9.5000, iter time: 993.84ms, t: 8192 iter 2: loss 5.7188, iter time: 1012.46ms, t: 8192 iter 3: loss 4.8125, iter time: 1013.92ms, t: 8192 iter 4: loss 4.6875, iter time: 1003.81ms, t: 8192 iter 5: loss 4.7188, iter time: 1015.76ms, t: 8192 iter 6: loss 4.6875, iter time: 1002.12ms, t: 8192 iter 7: loss 4.6562, iter time: 999.51ms, t: 8192 iter 8: loss 4.6562, iter time: 1006.47ms, t: 8192 iter 9: loss 4.6562, iter time: 1012.25ms, t: 8192

When I did training on some real data the difference is much larger: Eager

12:30:08 | Iteration 0: loss 11.9375, time: 5652.83ms 12:30:09 | Iteration 1: loss 11.5000, time: 1446.79ms 12:30:10 | Iteration 2: loss 10.8125, time: 1078.02ms 12:30:11 | Iteration 3: loss 9.0625, time: 1082.69ms 12:30:12 | Iteration 4: loss 8.7500, time: 1084.27ms 12:30:13 | Iteration 5: loss 8.3125, time: 1083.96ms 12:30:14 | Iteration 6: loss 8.3750, time: 1080.23ms 12:30:15 | Iteration 7: loss 8.0625, time: 1081.94ms 12:30:17 | Iteration 8: loss 8.8750, time: 1079.18ms 12:30:18 | Iteration 9: loss 8.3750, time: 1081.14ms 12:30:19 | Iteration 10: loss 7.8438, time: 1084.09ms

Thunder

12:43:51 | Iteration 0: loss 1.0312, time: 74820.20ms 12:43:52 | Iteration 1: loss 13.8750, time: 910.84ms 12:43:52 | Iteration 2: loss 27.2500, time: 920.43ms 12:43:53 | Iteration 3: loss 10.8125, time: 926.16ms 12:43:54 | Iteration 4: loss 9.3125, time: 929.09ms 12:43:55 | Iteration 5: loss 8.2500, time: 921.80ms 12:43:56 | Iteration 6: loss 7.7500, time: 920.07ms 12:43:57 | Iteration 7: loss 7.6562, time: 921.99ms 12:43:58 | Iteration 8: loss 8.3750, time: 923.90ms 12:43:59 | Iteration 9: loss 8.1250, time: 921.77ms 12:44:00 | Iteration 10: loss 7.7188, time: 928.60ms

When I change the seed for Eager mode I get different initial values of loss however they are still oscillating around 10 and the variability is not as large as for Thunder (loss = 1 or 27).

Jul 29 '24 13:07 mpatel31415

fyi @IvanYashchuk

Jul 29 '24 15:07 mruberry

I will look into this.

Jul 30 '24 09:07 IvanYashchuk