CPU support ?
can the model run on cpu ? because I'm getting this error (using MacBook M1 Max)
File "/Users/remybarranco/Library/Caches/pypoetry/virtualenvs/starvector-pPvMD9yO-py3.12/lib/python3.12/site-packages/transformers/modeling_utils.py", line 1710, in _check_and_enable_flash_attn_2
raise ImportError(f"{preface} the package flash_attn seems to be not installed. {install_message}")
ImportError: FlashAttention2 has been toggled on, but it cannot be used due to the following error: the package flash_attn seems to be not installed. Please refer to the documentation of https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 to install Flash Attention 2.
I do not recommend running on CPU. Will be very slow.
There will be many who want to run this on CPU no matter how slow. Support for Mac Mxx cpus would be great for this. Can you give any guidance on how to make this work ( even if its slow )
Thank you
I got it to run in my repo, but my Mac M1 only had 32GB ram, and when I set the max_length to 1000, it did not get very far with the _hf demo. I have a branch called MacM1 if anyone is interested.
FYI - @joanrod is correct that this is a beast :) It took about 2 minutes to complete (and used 39GB RAM) for the small demo file, even when it worked. Maybe if someone has a 64GB Mac, they might have better luck.
I highly recommend using the vLLM backend—please check the instructions in the repository. It’s more than ten times faster. However, I haven’t tested it on a CPU yet; it should still be faster. We could also explore quantization techniques to further improve performance. @leonletto @Phaired
Ok, VLLM is a good workaround, even though it's a bit painful to set up. thanks