HAI comments

Results 32 comments of

HAI

Pipelined-hip embeddingBag forward.

Note, we keep AMD optimal kernel for embeddings as separate module (fbgemm_gpu/hip_kernel) is to facilitate: - vendor innovation with a degree of freedom to drop-in the very best and very...

Support FP8 KV Cache

My concern about this PR is that it will incur performance, compatibility and interop issues when compare with FP8 serving/inference by nVIDIA and AMD, etc. solutions. None of them is...

Support FP8 KV Cache

@zhaoyang-star Good that you noticed my concern, IMO I tend to reject this idea to use E5M2 without scaling (from/to wider precision numbers: fp16/fp32/etc.) , but I will try to...

Support FP8 KV Cache

> @HaiShaw Your concern is very important to the feature. I think we could consider adding scaling factor in the next PR. I insist on using e5m2. e4m3 will lead...

> @zhaoyang-star Is there precision problem convert from bfloat16 to fp8? because exponent is not same. ![图片](https://private-user-images.githubusercontent.com/30689370/296952869-4f17c1bd-b57e-446a-8638-16552bdd6dd7.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MDUzODczNjgsIm5iZiI6MTcwNTM4NzA2OCwicGF0aCI6Ii8zMDY4OTM3MC8yOTY5NTI4NjktNGYxN2MxYmQtYjU3ZS00NDZhLTg2MzgtMTY1NTJiZGQ2ZGQ3LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDAxMTYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwMTE2VDA2Mzc0OFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTg2ODM3NzkwZmU0YmNjOWQ4YmE5YjE3ZjU5M2ExYWQ5NTNmZDg2NGFjN2M5Y2M5NDMzMzM2NWRmYzk4ZGM0ZmEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.SYAwuH9PyuzNRf5Lcbi5GdqQNhAHDhSD2YnAq1Weohc) @Shawn314 cuda intrinsic along would just cast wider precision numbers up to FP8...

Support FP8 KV Cache

> @zhaoyang-star @HaiShaw Thanks your explanation, very clearly! @Shawn314 @zhaoyang-star I just opened a FP8 discussion below, comments are welcome! https://github.com/vllm-project/vllm/discussions/2461

[RFC]: Refactor FP8 kv-cache

w.r.t. `We cannot enable e4m3 and e5m2 in the same build.` If we look to have a build with both supported on a same newer hardware, most likely we won't...

[RFC]: Refactor FP8 kv-cache

w.r.t. `When running FP8 model, we load kv-cache scaling factor from the model checkpoint.` We shall have serialized checkpoint with various scaling factors defined, to both the stationary scaling factors...

[RFC]: Refactor FP8 kv-cache

``` A wrapper function convert that converts Tin to Tout with a particular FP8 format. For example, when writing values to kv-cache, Tin=uint16_t, Tout=uint8_t, kv_dt=kFp8E4M3 ``` Over time, it makes...

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU)

> @AdrianAbeyta maybe it would be nicer UX if the `kv_cache_scales.json` were packaged with the model and the path to it was referenced in the model's `config.json`? Then it could...