zzlol63 comments

Results 14 comments of


                                            zzlol63

Add flash-attn support for Windows

I re-wrote the PR to support older models that directly use `scaled_dot_product_attention` such as SDXL and SD1.5. Below are the benchmark results, SD1.5, SDXL and FLUX.1 enjoy a 12-20% speedup...

Add flash-attn support for Windows

Also have updated requirements-cuda.txt to include the precompiled wheels for flash-attn (pinned to PyTorch 2.7.x) automatically for ease of use for Windows users.

Add flash-attn support for Windows

> > Also have updated requirements-cuda.txt to include the precompiled wheels for flash-attn (pinned to PyTorch 2.7.x) automatically for ease of use for Windows users. > > OneTrainer is using...

Add flash-attn support for Windows

Have now migrated the patch into the UI as a toggle. It's now possible to turn the toggle on and off while training and immediately see the difference in speed...

Add flash-attn support for Windows

@O-J1 Fair enough, I've moved it into the GenericTrainer.

[Feat]: Attention backend selection for Diffusers

Preliminary investigation for Chroma suggests the shape of the attention_mask does not match what is expected and hence leading to the validation error: https://github.com/huggingface/diffusers/blob/8f80dda193f79af3ccd0f985906d61123d69df08/src/diffusers/models/transformers/transformer_chroma.py#L256 The issue can also be replicated...

[Feat]: Attention backend selection for Diffusers

After doing some research, standard FlashAttention does not support attention masking out of the box which is required by some models to implement functionality such as caption dropout. This will...

[Feat]: Attention backend selection for Diffusers

Try with build-isolation disabled and make sure to restrict the MAX_JOBS used by ninja (ensure package is installed prior for faster build) or you will OOM your system: `MAX_JOBS=4 pip...

[Feat]: Attention backend selection for Diffusers

@dxqb There are a few repos which offer precompiled wheels, but only certain PyTorch+CUDA+Python configurations are supported. Below is an example for Linux: https://github.com/mjun0812/flash-attention-prebuild-wheels I think we should focus on...

[Feat]: Attention backend selection for Diffusers

Also, I fully trained a Chroma LoRA last night using masked training + prior preservation with FlexAttention (as per https://github.com/Nerogar/OneTrainer/issues/1090#issuecomment-3477943415) and results turned out fine - model converged as expected...