Philipp Moritz comments

Results 85 comments of


                                            Philipp Moritz

[RFC]: Refactor FP8 kv-cache

I'm +1 to supporting activation scales in the FP16 checkpoint and not in JSON. This way less configurations need to be supported and everything is uniform :)

[RFC]: Refactor FP8 kv-cache

> I think I've tried several ways but didn't luck at this moment (maybe I also miss something). I suppose the main reason is kv quantization happens during kv-cache write...

[RFC]: Refactor FP8 kv-cache

Ah I see, in that case, `kAuto` is a good name since it is the same as "auto" in python. I didn't realize it required a special code path :)

support for 3D?

Hey jingpengwu, thanks for your message! I'm mostly using Python these days, so I probably won't implement it myself at the moment, but if you are interested in helping, I'm...

support for 3D?

I'm glad to hear the project could be useful for your work! First you will need to update the version of Caffe that is included in Strada. It needs to...

Enable scaled FP8 (e4m3fn) KV cache on ROCm (AMD GPU)

Can the files `hip_float8.h` and `hip_float8_impl.h` be part of some AMD SDK going forward? They shouldn't be part of vLLM :)

[core] Add opt-in flag for Windows and OSX clusters, update `ray start` output to match docs

Thanks for doing this ❤️ Just a small nit: At the moment we have an unholy mix of sometimes `1` being true and sometime `"true"` being true for environment variables...

Use setup.py develop for installation

Richard pointed out to me that there is python setup.py develop which does what we want for development (i.e. you don't need to re-run python setup.py if you edit the...

[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support

It is a bummer that github doesn't render the diff between the old and new nvidia quant_utils.cuh -- for ease of reviewing, here is the diff: ```diff (base) pcmoritz@pcmoritz-DQ44HV60WX /tmp...

[Kernel] Refactor FP8 kv-cache with NVIDIA float8_e4m3 support

Did you investigate the performance impact of passing `__nv_fp8_interpretation_t` around at runtime? Have you considered making the format a template parameter of the `vec_conversion` and related functions (e.g. by reusing...