Casper comments

Results 293 comments of


                                            Casper

FlashAttention V3

The API is close to being the same as in v2 and is almost a drop-in replacement, but not completely as you outlined. On your end, you need to install...

FlashAttention V3

FA3 is specifically developed for Hopper because the architecture has new instructions that previous architectures do not.

Small script to compile the kernels is seen below. I think this has a lot of potential :) ``` git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention/hopper pip install ninja packaging FLASH_ATTENTION_DISABLE_SM80=TRUE FLASH_ATTENTION_DISABLE_FP8=TRUE...

FlashAttention V3

@NanoCode012 There are no API changes except missing features. So you really want both v2 and v3 installed for this to work. I would suggest this code to test things...

FlashAttention V3

I never actually got to finish building it, I got impatient because it takes a long time and wanted to do other stuff. - dropout_p: yes, this is dropped. not...

Create base docker images for CUDA 12.8 with custom FlashAttention 3 installed

@winglian Unless there is something wrong with the new flash attention version or the speed improvement advertisement is wrong, then I think it must be something with the setup. Quick...

support minicpm3.0

@LDLINGLINGLING Sorry for taking so long. I simplified the modeling and added your custom quantizer to the docs. We now use Triton kernels which work with smaller models like MiniCPM3...

3-bit or 6-bit quantization

We do not have a solution to store weights in 3 or 6 bits, nor do we know how to run inference just yet. I’m open for PRs on this

Pipeline Parallelism (Supported? How to?)

I would really appreciate if you could look into it! TorchTitan uses `torch.distributed.pipelining`, most of which is only available from 2.5.0 or in nightly builds. There are many key features...

DeepSeek V3 Support

@tianyu-l Given the performance of this specific model and the recent boom in activity, can we reasonably expect TorchTitan to support this model? I understand this model is not created...