Arthur comments

Results 795 comments of


                                            Arthur

FlashAttention works with single GPU, but crash with accelerate DP on multiple GPU (FlashAttention only support fp16 and bf16 data type)

I am getting a similar issue without training with torch nightly on Llama so can confirm something's wrong! Might be on our side, but as far as I tested all...

FlashAttention works with single GPU, but crash with accelerate DP on multiple GPU (FlashAttention only support fp16 and bf16 data type)

```python >>> from flash_attn import flash_attn_func >>> import torch >>> print(torch.__version__) 2.3.0.dev20240208+cu121 >>> flash_attn_func(torch.ones((2,3), dtype=torch.bfloat16), torch.ones((2,3), dtype=torch.bfloat16), torch.ones((2,3), dtype=torch.bfloat16), 1, softmax_scale=1, causal=True) ``` .... ``` File ~/miniconda3/envs/py310/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py:51, in _flash_attn_forward(q, k,...

FlashAttention works with single GPU, but crash with accelerate DP on multiple GPU (FlashAttention only support fp16 and bf16 data type)

The error is before that, but it seems it's torch nightly, the `transformers` snippet works with torch2.2 ! (vs getting the `FlashAttention only support fp16 and bf16 data type` with...

Converting `tokenizers` tokenizers into `tiktoken` tokenizers

I am actually interested in deep diving a bit into potential reasons why we are slower, and update our implementation based on this as long as we don't break, and...

Error in DataCollatorSpeechSeq2SeqWithPadding (Unit 5)

If the `self.processor.tokenizer.bos_token_id` is correctly set, (it should not be used in the sense that it is not used, of forced_decoder_ids is set it will be taken, instead of this...