Liger-Kernel
Liger-Kernel copied to clipboard
Efficient Triton Kernels for LLM Training
### 🐛 Describe the bug ``` ... File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 173, in triton kernel = TritonCodeCache.load(kernel_name, source_code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 3112, in load return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/codecache.py",...
I wonder why you use 32 (256 threads/block instance) here instead of deciding based on hidden size? Thanks. https://github.com/linkedin/Liger-Kernel/blob/dd86cbd2092177681acf75643ded1b23a785a816/src/liger_kernel/ops/fused_linear_cross_entropy.py#L95
## Summary This PR aims to support pixtral ## Testing Done tested model + tested monkey patch - Hardware Type: 3090 - [x] run `make test` to ensure correctness -...
## Summary Another small optimization :) The `logits_chunk.float()` allocation may be surprisingly large, e.g. Cohere models have 256K vocabs, so each logit chunk in float32 could be something like 1024...
## Summary ## Testing Done - Hardware Type: - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ]...
As the community grows, keeping track of issues and PRs becomes more and more challenging. This pinned issue will serve as the central place to manage the progress in 2024...
### 🚀 The feature, motivation and pitch TVD is a good distance metric ([ref](https://aclanthology.org/2023.acl-long.605.pdf)) and easy to implement kernel to make the gradient more stable compared to KL divergence and...
## Summary This PR adds a Flash Attention 2 triton kernel and the monkey-patching of SDPA attention layers with our FA kernel. ## Details The kernel supports fp16 and bfloat16,...
### 🐛 Describe the bug Tried to reproduce the liger kernel optimization on lighting trainer with deepspeed zero3 but encountered several errors. ### Reproduce script: ``` cd /examples/lightning/ python training.py...
### 🚀 The feature, motivation and pitch We've had implemented KL divergence and JSD loss. Thanks to the community! This feature request is to: add an optional feature for ignoring...