Casper
Casper
The API is close to being the same as in v2 and is almost a drop-in replacement, but not completely as you outlined. On your end, you need to install...
FA3 is specifically developed for Hopper because the architecture has new instructions that previous architectures do not.
Small script to compile the kernels is seen below. I think this has a lot of potential :) ``` git clone https://github.com/Dao-AILab/flash-attention.git cd flash-attention/hopper pip install ninja packaging FLASH_ATTENTION_DISABLE_SM80=TRUE FLASH_ATTENTION_DISABLE_FP8=TRUE...
@NanoCode012 There are no API changes except missing features. So you really want both v2 and v3 installed for this to work. I would suggest this code to test things...
I never actually got to finish building it, I got impatient because it takes a long time and wanted to do other stuff. - dropout_p: yes, this is dropped. not...
@winglian Unless there is something wrong with the new flash attention version or the speed improvement advertisement is wrong, then I think it must be something with the setup. Quick...
@LDLINGLINGLING Sorry for taking so long. I simplified the modeling and added your custom quantizer to the docs. We now use Triton kernels which work with smaller models like MiniCPM3...
We do not have a solution to store weights in 3 or 6 bits, nor do we know how to run inference just yet. I’m open for PRs on this
I would really appreciate if you could look into it! TorchTitan uses `torch.distributed.pipelining`, most of which is only available from 2.5.0 or in nightly builds. There are many key features...
@tianyu-l Given the performance of this specific model and the recent boom in activity, can we reasonably expect TorchTitan to support this model? I understand this model is not created...