candle Clean the duplicated processor in the quantized example

Clean the duplicated processor in the quantized example

Open clearloop opened this issue 10 months ago • 2 comments

Hi team! Thanks to the awesome work that bringing rust to the game!

I found that the usage of LogitsProcessor in the quantized example is not proper which makes the chat process iterating unnecessary resources with bad performance as results, people trying out the example may think it is caused by candle (like the performance of candle sucks comparing with llama.cpp XD, I did think so before reviewing the code carefully )

since we are using a loop here

https://github.com/huggingface/candle/blob/236c35e5789723efe772f41920f3ac071bdff24d/candle-examples/examples/quantized/main.rs#L508

we don't have to inference all tokens again here

https://github.com/huggingface/candle/blob/236c35e5789723efe772f41920f3ac071bdff24d/candle-examples/examples/quantized/main.rs#L575

instead, we can just move the LogitsProcessor out of the global interactive loop with extra cache of tokens including users' prompts

https://github.com/huggingface/candle/blob/236c35e5789723efe772f41920f3ac071bdff24d/candle-examples/examples/quantized/main.rs#L559

This could be related to #1939 , the example is super slow from the second prompt

Jan 07 '25 19:01 clearloop

candle candle copied to clipboard

Clean the duplicated processor in the quantized example

candle
candle copied to clipboard