Edd

Results 22 comments of Edd

I think there's no problem in the code Gemma2, Llama3.2, and Qwen has huge amount of vocab size. Therefore the `embedding` and `lm_head` layer is very huge When doing `CPT`,...

I am not sure exactly why we need to saves both the `original_module` and `modules_to_save`? I guess because when you doing LoRA, you can't just push gradient to the same...