Minh-Thuc
Minh-Thuc
Hello @ebraraktas, sorry for the late response. I think you should add implementation of ``RUY`` after ``BLAS``. The RUY implementation will be used if only the ``BLAS`` does not exist.
Thanks. I relaunched the failed CI. I'll merge then
I just pushed a PR #1599 to support tensor parallel. This will help to split models into multiple GPUs different. I tested this feature with some models like Llama2, translator...
I updated the comment above for 2 cases: batch_size = 1 and batch_size = 5.
I'll close this issue as the feature is now supported. If you have any problems, feel free to open the new issue.
I closed this issue as completed. The tensor parallelism is currently supported in Ctranslate2.
Can you provide more in detail how to run the converter?
If you specified the target_prefix, it would decode once in a step then generate one by one with the next steps. Without target_prefix, it would generate one by one token....
Hello, What is the average seq_length in your benchmark? The flash attention have a better performance for the long prompt only.
I means number of token of input. I would be great to compare with and without FA2 with the prompt's size from 1000 to 3000 tokens. I think the prompt...