Martyna Patelka
Martyna Patelka
When running: ``` python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \ --model_name stablecode-completion-alpha-3b \ --compile "thunder_inductor" ``` I still get OOM error for tag pjnl-20240427. **Memory usage:** * Thunder: 77.03 GB * Torch compile: Memory...
Should I report it in [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) repo then?
I think the issue was resolved and it's doesn't happen anymore.
We see this error in recent runs as well. I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add `--n_layers 1` flag to reduce...
Yes, the same error is present when running: `python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --compile thunder_cudnn --low_precision_mode fp8-delayed-te --micro_batch_size 1 --n_layers 1`
FYI: I was curious if the code to save checkpoint is correct in Eager mode for sure, so I used it on each rank and then compared the shapes of...
Small update after discussion with @carmocca about saving checkpoints from Thunder FSDP: I tried to use `save` and `get_model_state_dict` functions provided by Thunder and then convert checkpoint into torch save...
Hi! Is there any update about this? From the Slack discussion and my understanding there were 3 options for me to progress: 1. Save distributed checkpoint in Thunder, convert it...
The same issue (lower possible batch size) is also present for pythia-12b and Nous-Hermes-13b models on 1x8 H100 and 2x8 H100. For Inductor we can use micro batch size 4,...
I noticed that similar problem is present for `falcon-40b`, `Platypus-30B` and `vicuna-33b-v1.3` models.