Martyna Patelka comments

Results 25 comments of


                                            Martyna Patelka

Thunder + Inductor gives OOM for stablecode-completion-alpha-3b model from LitGPT

When running: ``` python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py \ --model_name stablecode-completion-alpha-3b \ --compile "thunder_inductor" ``` I still get OOM error for tag pjnl-20240427. **Memory usage:** * Thunder: 77.03 GB * Torch compile: Memory...

No module named 'flax' when using thunder/benchmarks/benchmark_litgpt.py

Should I report it in [TransformerEngine](https://github.com/NVIDIA/TransformerEngine) repo then?

No module named 'flax' when using thunder/benchmarks/benchmark_litgpt.py

I think the issue was resolved and it's doesn't happen anymore.

RuntimeError and Socket Connection Failure when Benchmarking Gemma-7b with Micro Batch Size 1

We see this error in recent runs as well. I'm able to reproduce it on 8xNVIDIA H100 80GB HBM3. Maybe you could try to add `--n_layers 1` flag to reduce...

RuntimeError and Socket Connection Failure when Benchmarking Gemma-7b with Micro Batch Size 1

Yes, the same error is present when running: `python /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Gemma-7b --compile thunder_cudnn --low_precision_mode fp8-delayed-te --micro_batch_size 1 --n_layers 1`

Models trained with FSDP + Thunder doesn't work with litgpt chat

FYI: I was curious if the code to save checkpoint is correct in Eager mode for sure, so I used it on each rank and then compared the shapes of...

Models trained with FSDP + Thunder doesn't work with litgpt chat

Small update after discussion with @carmocca about saving checkpoints from Thunder FSDP: I tried to use `save` and `get_model_state_dict` functions provided by Thunder and then convert checkpoint into torch save...

Models trained with FSDP + Thunder doesn't work with litgpt chat

Hi! Is there any update about this? From the Slack discussion and my understanding there were 3 options for me to progress: 1. Save distributed checkpoint in Thunder, convert it...

OOM errors for Gemma-7, pythia-12b, Llama-2-13b-hf and Nous-Hermes-13b with FSDP zero3 and 2x8 H100

The same issue (lower possible batch size) is also present for pythia-12b and Nous-Hermes-13b models on 1x8 H100 and 2x8 H100. For Inductor we can use micro batch size 4,...

OOM errors for Gemma-7, pythia-12b, Llama-2-13b-hf and Nous-Hermes-13b with FSDP zero3 and 2x8 H100

I noticed that similar problem is present for `falcon-40b`, `Platypus-30B` and `vicuna-33b-v1.3` models.