lightning-thunder icon indicating copy to clipboard operation
lightning-thunder copied to clipboard

AssertionError: t54580_out for rematerialisation for Gemma-2-27b and other models.

Open mpatel31415 opened this issue 1 year ago • 0 comments

🐛 Bug

When running benchmarking script with --checkpoint_activations True we get:

AssertionError: t54580_out for rematerialisation

This issue is present for the following models: 'Llama-3-70B', 'Gemma-2-27b', 'longchat-13b-16k', 'Mistral-7B-v0.2', 'vicuna-7b-v1.5-16k', 'Llama-2-13b-hf', 'CodeLlama-34b-hf'

To Reproduce

Please use: 1 node(s), each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20240930" Training script: python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py
--model_name Gemma-2-27b
--distributed_mode fsdp
--shard_mode zero2
--compile thunder
--checkpoint_activations True
--low_precision_mode none
--micro_batch_size 1

Expected behavior

We should be able to run the training.

Environment

"system.device_product_name DGXH100 system.gpu_driver_version 535.129.03 libraries.cuda 12.6.2.004 libraries.pip.lightning 2.4.0.dev20240728 libraries.pip.lightning-thunder 0.2.0.dev0 libraries.pip.lightning-utilities 0.11.7 libraries.pip.litgpt 0.4.11 libraries.pip.nvfuser 0.2.13+git2cee59d libraries.pip.pytorch-lightning 2.4.0 libraries.pip.torch 2.6.0a0+gitc4ae451 libraries.pip.torchmetrics 1.4.2 libraries.pip.torchvision 0.19.0a0+d23a6e1"

mpatel31415 avatar Oct 01 '24 09:10 mpatel31415