transformers FP8 inference and FP8 KV cache

Feature request

Hi! Could anyone please help me with using HuggingFace models (LLaMa [or if LLaMa is difficult, MPT-7b]) with the TransformerEngine TE FP8 inference? We really need the speedup

https://github.com/NVIDIA/TransformerEngine/issues/199 This is a somewhat related issue to this topic.

Motivation

Faster inference and more specialized tensor operations means less cost and less latency.

Your contribution

I would really love to test suggestions out as I have temporary access to a H100 cloud GPU. I am not sufficient in porting the models myself which is why I created this issue.

I really appreciate any help, thank you very much.

May 22 '23 16:05 SinanAkkoyun

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 22 '23 15:06 github-actions[bot]

@SinanAkkoyun have you find the solution how to use transformerengine with Llama?

Aug 23 '23 06:08 AhsanAli1288

Any updates?

Sep 06 '23 11:09 maxpain

Gentle ping @fxmarty

Jun 28 '24 11:06 amyeroberts

Another ping @fxmarty. Could you nominate someone to take this over for you?

Jul 23 '24 10:07 amyeroberts

cc @IlyasMoutawwakil

Aug 19 '24 11:08 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Oct 11 '24 08:10 github-actions[bot]