Liger-Kernel
Liger-Kernel copied to clipboard
Efficient Triton Kernels for LLM Training
### 🐛 Describe the bug This issue is to discuss how we should modify our convergence test to handle numerical issues of logits. ### Context In #704, we make `FusedLinearCrossEntropy`(flce)...
### 🐛 Describe the bug The tolerance when comparing loss in gemma3 multimodal model need to be set high (atol,rtol - 1e-3) compare to others (atol=1e-8,rtol=1e-5) in order to pass...
Hello, So we discussed an approach to solve this: Running all the benchmarks take less than an hour. I tried it on a single H100 GPU and it took me...