HAI
HAI
Note, we keep AMD optimal kernel for embeddings as separate module (fbgemm_gpu/hip_kernel) is to facilitate: - vendor innovation with a degree of freedom to drop-in the very best and very...
My concern about this PR is that it will incur performance, compatibility and interop issues when compare with FP8 serving/inference by nVIDIA and AMD, etc. solutions. None of them is...
@zhaoyang-star Good that you noticed my concern, IMO I tend to reject this idea to use E5M2 without scaling (from/to wider precision numbers: fp16/fp32/etc.) , but I will try to...
> @HaiShaw Your concern is very important to the feature. I think we could consider adding scaling factor in the next PR. I insist on using e5m2. e4m3 will lead...
> @zhaoyang-star Is there precision problem convert from bfloat16 to fp8? because exponent is not same.  @Shawn314 cuda intrinsic along would just cast wider precision numbers up to FP8...
> @zhaoyang-star @HaiShaw Thanks your explanation, very clearly! @Shawn314 @zhaoyang-star I just opened a FP8 discussion below, comments are welcome! https://github.com/vllm-project/vllm/discussions/2461
w.r.t. `We cannot enable e4m3 and e5m2 in the same build.` If we look to have a build with both supported on a same newer hardware, most likely we won't...
w.r.t. `When running FP8 model, we load kv-cache scaling factor from the model checkpoint.` We shall have serialized checkpoint with various scaling factors defined, to both the stationary scaling factors...
``` A wrapper function convert that converts Tin to Tout with a particular FP8 format. For example, when writing values to kv-cache, Tin=uint16_t, Tout=uint8_t, kv_dt=kFp8E4M3 ``` Over time, it makes...
> @AdrianAbeyta maybe it would be nicer UX if the `kv_cache_scales.json` were packaged with the model and the path to it was referenced in the model's `config.json`? Then it could...