fxmarty

Results 333 comments of fxmarty

keras CI looks broken

@almersawi it is in good shape to me. cc @OlivierDehaene

Compared to calling the kernel in isolation, in my config end to end benchmark is slower using quanto, due to many overheads. ![image](https://github.com/user-attachments/assets/e7afc0ce-9d72-480a-b54a-9482e5a2c3ed) A separate PR may be needed to...

For sure. Last week I was using quanto benchmark script with ```bash python evaluate_model.py --device cuda --metric decode-latency --quantizer quanto --weights float8_e4m3fn --activations none --dtype fp16 --batch_size 1 python evaluate_model.py...

@dacorvo Running `python evaluate_model.py --device cuda --metric decode-latency --quantizer quanto --weights float8_e4m3fn --activations none --dtype fp16 --batch_size 1 --model TinyLlama/TinyLlama-1.1B-Chat-v1.0` on my laptop between https://github.com/huggingface/optimum-quanto/pull/241/commits/b8dbdf0f6ada8d08076dac974832825436f79256 and https://github.com/huggingface/optimum-quanto/pull/241/commits/d52e44a9c99ffc922e4bdda266100132a6348780 (avoid dispatch when...

> the restriction to per-tensor scale makes the kernel unusable I think this can be easily changed in a later PR. Neither vllm nor tgi support per-column scales, and yet...

On par with my tests in TGI, we have decent speedup with this kernel only when using cudagraph. I don't really explain myself why. Using transformers + A100 + 8B...

Hi @brainwo, thanks a lot for giving a try! It's been a while I did not update the repo, I'll give it a shot. I think here what happens is...

@kilianyp Thank you. No it is not an option yet. Yes it should happen, I have no timeline though. This would be very useful.

If you would like to contribute this, happy to help review!