Martyna Patelka
Martyna Patelka
I tested it and `benchmark.model.get_parameter('lm_head.weight')[:10]` still gives shape [10, 4096] for Thunder and [10] for Eager. Also it is expected that the values of parameters are different between Thunder and...
In case of Thunder it's Thunder module: thunder.core.module.ThunderModule. IN case of Eager it's the original module.
The value of loss is also different between Thunder and Eager: **Eager:** > iter 0: loss 11.9375, iter time: 6618.87ms, t: 8192 > iter 1: loss 9.8750, iter time: 1466.43ms,...
I reproduced the issue manually on a cluster - here you can find full logs: [slurm-930652.txt](https://github.com/user-attachments/files/16247072/slurm-930652.txt)
Hi all! I wrote recently that the issue it fixed - but I checked it only for one model (Gemma-7b). The error is still present (checked on INTERNAL_IMAGE:pjnl-20240830_ for Mistral-7B-v0.2,...
In the recent run the issue was present only for 3 cases and 2 models ('CodeLlama-34b-hf', 'falcon-40b') and I checked that it's not present at all for 2 cases (one...
Hi! So this issue was present recently in 7 cases, all are using fp8, below are reproduction instructions: ``` Please use: 1 node(s), each with 8 GPUs. Image "INTERNAL_IMAGE:pjnl-20241011" Training...
For the most recent set of issues I used this script to reproduce the error: ``` #!/bin/bash #SBATCH -A YOUR_ACCOUNT #SBATCH -p batch #SBATCH -J YOUR_JOB_NAME #SBATCH -N 2 #SBATCH...
Actually we see the same results for other models. Is one issue enough to track all of them? Below are the results: 
Hi! Please let me know when we will be ready to check FSDP 2 again :)