Arthur

Results 795 comments of Arthur

I am getting a similar issue without training with torch nightly on Llama so can confirm something's wrong! Might be on our side, but as far as I tested all...

```python >>> from flash_attn import flash_attn_func >>> import torch >>> print(torch.__version__) 2.3.0.dev20240208+cu121 >>> flash_attn_func(torch.ones((2,3), dtype=torch.bfloat16), torch.ones((2,3), dtype=torch.bfloat16), torch.ones((2,3), dtype=torch.bfloat16), 1, softmax_scale=1, causal=True) ``` .... ``` File ~/miniconda3/envs/py310/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py:51, in _flash_attn_forward(q, k,...

The error is before that, but it seems it's torch nightly, the `transformers` snippet works with torch2.2 ! (vs getting the `FlashAttention only support fp16 and bf16 data type` with...

I am actually interested in deep diving a bit into potential reasons why we are slower, and update our implementation based on this as long as we don't break, and...

If the `self.processor.tokenizer.bos_token_id` is correctly set, (it should not be used in the sense that it is not used, of forced_decoder_ids is set it will be taken, instead of this...