Liger-Kernel issues

torch.compile() throws exception when LigerKernel is used

24

### 🐛 Describe the bug ``` ... File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/async_compile.py", line 173, in triton kernel = TritonCodeCache.load(kernel_name, source_code) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 3112, in load return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/tromero/workspace/seahorse/.venv/lib/python3.11/site-packages/torch/_inductor/codecache.py",...

tyler-romero

Choice of num_warps

5

I wonder why you use 32 (256 threads/block instance) here instead of deciding based on hidden size? Thanks. https://github.com/linkedin/Liger-Kernel/blob/dd86cbd2092177681acf75643ded1b23a785a816/src/liger_kernel/ops/fused_linear_cross_entropy.py#L95

Edenzzzz

[Model] Pixtral Support

4

## Summary This PR aims to support pixtral ## Testing Done tested model + tested monkey patch - Hardware Type: 3090 - [x] run `make test` to ensure correctness -...

AndreSlavescu

fused_linear_cross_entropy: Move float32 cast into kernel

6

## Summary Another small optimization :) The `logits_chunk.float()` allocation may be surprisingly large, e.g. Cohere models have 256K vocabs, so each logit chunk in float32 could be something like 1024...

hansonw

added group norm

## Summary ## Testing Done - Hardware Type: - [ ] run `make test` to ensure correctness - [ ] run `make checkstyle` to ensure code style - [ ]...

denti

reviewing

2024 Q4 Roadmap

As the community grows, keeping track of issues and PRs becomes more and more challenging. This pinned issue will serve as the central place to manage the progress in 2024...

ByronHsu

Add TVD (Total variation distance) Kernel

1

### 🚀 The feature, motivation and pitch TVD is a good distance metric ([ref](https://aclanthology.org/2023.acl-long.605.pdf)) and easy to implement kernel to make the gradient more stable compared to KL divergence and...

qingquansong

feature

[Kernel] Flash attention 2

4

## Summary This PR adds a Flash Attention 2 triton kernel and the monkey-patching of SDPA attention layers with our FA kernel. ## Details The kernel supports fp16 and bfloat16,...

remi-or

Encountered errors when reproducing lightning training example

2

### 🐛 Describe the bug Tried to reproduce the liger kernel optimization on lighting trainer with deepspeed zero3 but encountered several errors. ### Reproduce script: ``` cd /examples/lightning/ python training.py...

ReginaZh

Adding ignore index support for divergence losses

### 🚀 The feature, motivation and pitch We've had implemented KL divergence and JSD loss. Thanks to the community! This feature request is to: add an optional feature for ignoring...

qingquansong

feature

Liger-Kernel
Liger-Kernel copied to clipboard

Metadata

torch.compile() throws exception when LigerKernel is used

Choice of num_warps

[Model] Pixtral Support

fused_linear_cross_entropy: Move float32 cast into kernel

added group norm

2024 Q4 Roadmap

Add TVD (Total variation distance) Kernel

[Kernel] Flash attention 2

Encountered errors when reproducing lightning training example

Adding ignore index support for divergence losses

← Metadata

Owner

Metadata

Liger-Kernel Liger-Kernel copied to clipboard

Metadata

← Metadata

Owner

Metadata

Liger-Kernel
Liger-Kernel copied to clipboard