Liger-Kernel icon indicating copy to clipboard operation
Liger-Kernel copied to clipboard

Efficient Triton Kernels for LLM Training

Results 114 Liger-Kernel issues
Sort by recently updated
recently updated
newest added
trafficstars

### 🐛 Describe the bug ``` ... File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 173, in triton kernel = TritonCodeCache.load(kernel_name, source_code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 3112, in load return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/codecache.py",...

I wonder why you use 32 (256 threads/block instance) here instead of deciding based on hidden size? Thanks. https://github.com/linkedin/Liger-Kernel/blob/dd86cbd2092177681acf75643ded1b23a785a816/src/liger_kernel/ops/fused_linear_cross_entropy.py#L95

## Summary This PR aims to support pixtral ## Testing Done tested model + tested monkey patch - Hardware Type: 3090 - [x] run `make test` to ensure correctness -...

## Summary Another small optimization :) The `logits_chunk.float()` allocation may be surprisingly large, e.g. Cohere models have 256K vocabs, so each logit chunk in float32 could be something like 1024...

## Summary ## Testing Done - Hardware Type: - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ]...

reviewing

As the community grows, keeping track of issues and PRs becomes more and more challenging. This pinned issue will serve as the central place to manage the progress in 2024...

### 🚀 The feature, motivation and pitch TVD is a good distance metric ([ref](https://aclanthology.org/2023.acl-long.605.pdf)) and easy to implement kernel to make the gradient more stable compared to KL divergence and...

feature

## Summary This PR adds a Flash Attention 2 triton kernel and the monkey-patching of SDPA attention layers with our FA kernel. ## Details The kernel supports fp16 and bfloat16,...

### 🐛 Describe the bug Tried to reproduce the liger kernel optimization on lighting trainer with deepspeed zero3 but encountered several errors. ### Reproduce script: ``` cd /examples/lightning/ python training.py...

### 🚀 The feature, motivation and pitch We've had implemented KL divergence and JSD loss. Thanks to the community! This feature request is to: add an optional feature for ignoring...

feature