candle
candle copied to clipboard
Clean the duplicated processor in the quantized example
Hi team! Thanks to the awesome work that bringing rust to the game!
I found that the usage of LogitsProcessor in the quantized example is not proper which makes the chat process iterating unnecessary resources with bad performance as results, people trying out the example may think it is caused by candle (like the performance of candle sucks comparing with llama.cpp XD, I did think so before reviewing the code carefully )
since we are using a loop here
https://github.com/huggingface/candle/blob/236c35e5789723efe772f41920f3ac071bdff24d/candle-examples/examples/quantized/main.rs#L508
we don't have to inference all tokens again here
https://github.com/huggingface/candle/blob/236c35e5789723efe772f41920f3ac071bdff24d/candle-examples/examples/quantized/main.rs#L575
instead, we can just move the LogitsProcessor out of the global interactive loop with extra cache of tokens including users' prompts
https://github.com/huggingface/candle/blob/236c35e5789723efe772f41920f3ac071bdff24d/candle-examples/examples/quantized/main.rs#L559
This could be related to #1939 , the example is super slow from the second prompt