flash-linear-attention issues

[RFC] Unifying `chunk` and `fused_chunk` mode

### Proposal The `chunk` and `fused_chunk` modes have complementary strengths in different scenarios. The interface should be unified so that the user is agnostic to the underlying implementation. The API...

sustcsonglin

enhancement

[RFC] add Griffin and RecurrentGemma kernels

3

sustcsonglin

enhancement

stale

[Feature Request] Customize initialization / Add a switch for turning off FLA's initialization

4

### Feature Request 1. Include a flag (e.g., `use_default_init`) in the model configuration or constructor. When set to `False`, this flag would disable FLA's default initialization logic entirely, allowing users...

Triang-jyed-driung

enhancement

[RFC] Support context parallelism

### Proposal support context parallelism for all linear attention models ### Rationale One of the major advantages of linear attention is that it enables long sequence modeling. However, for training...

sustcsonglin

enhancement

[RFC] Add xLSTM kernels

sustcsonglin

enhancement

[Feature Request]

7

### Feature Request Hello, Thank you for all of your great work. I was wondering if it would be a reasonable inclusion to add even more fused linear activation functions...

conceptofmind

enhancement

[RFC] Increase computation intensity for certain kernels

1

### Proposal The current `chunk` mode normally loads 64x64 blocks, do the computation, and then save the resulting hidden state, which could bring I/O burden. In Tri Dao's Mamba2 implementation...

sustcsonglin

enhancement

[RFC] Fuse shortconv and output norm/gate into the kernel

### Proposal fuse shortconv and output norm/gate into kernels, as in Mamba1 and Mamba2 ### Rationale QKV ShortConv will introduce three additional activations, resulting in a non-negligible memory overhead.

sustcsonglin

enhancement

[RFC] Add YOCO models

1

yzhangcs

enhancement

stale

[RFC] Implement model-specific 4d parallelism

1

### Proposal * We want to add `apply_tp` & `apply_cp` fns for each models as their layer definitions can be varied. Also see comments in https://github.com/fla-org/flame/issues/4

yzhangcs

enhancement

flash-linear-attention
flash-linear-attention copied to clipboard

Metadata

[RFC] Unifying `chunk` and `fused_chunk` mode

[RFC] add Griffin and RecurrentGemma kernels

[Feature Request] Customize initialization / Add a switch for turning off FLA's initialization

[RFC] Support context parallelism

[RFC] Add xLSTM kernels

[Feature Request]

[RFC] Increase computation intensity for certain kernels

[RFC] Fuse shortconv and output norm/gate into the kernel

[RFC] Add YOCO models

[RFC] Implement model-specific 4d parallelism

← Metadata

Owner

Metadata

flash-linear-attention flash-linear-attention copied to clipboard

Metadata

← Metadata

Owner

Metadata

flash-linear-attention
flash-linear-attention copied to clipboard