transformers icon indicating copy to clipboard operation
transformers copied to clipboard

FP8 inference and FP8 KV cache

Open SinanAkkoyun opened this issue 2 years ago • 4 comments

Feature request

Hi! Could anyone please help me with using HuggingFace models (LLaMa [or if LLaMa is difficult, MPT-7b]) with the TransformerEngine TE FP8 inference? We really need the speedup

https://github.com/NVIDIA/TransformerEngine/issues/199 This is a somewhat related issue to this topic.

Motivation

Faster inference and more specialized tensor operations means less cost and less latency.

Your contribution

I would really love to test suggestions out as I have temporary access to a H100 cloud GPU. I am not sufficient in porting the models myself which is why I created this issue.

I really appreciate any help, thank you very much.

SinanAkkoyun avatar May 22 '23 16:05 SinanAkkoyun

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Jun 22 '23 15:06 github-actions[bot]

@SinanAkkoyun have you find the solution how to use transformerengine with Llama?

AhsanAli1288 avatar Aug 23 '23 06:08 AhsanAli1288

Any updates?

maxpain avatar Sep 06 '23 11:09 maxpain

Gentle ping @fxmarty

amyeroberts avatar Jun 28 '24 11:06 amyeroberts

Another ping @fxmarty. Could you nominate someone to take this over for you?

amyeroberts avatar Jul 23 '24 10:07 amyeroberts

cc @IlyasMoutawwakil

amyeroberts avatar Aug 19 '24 11:08 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 11 '24 08:10 github-actions[bot]