candle Model reuse in TextGeneration examples

trafficstars

Hi, I'd like to rig one of the examples into a service, where the service (http) gets a prompt and runs TextGeneration. As it stands, TextGeneration wants to own model and tokenizer, which means they need to be created from scratch each time (time consuming, and unacceptable for a per-request lifecycle). Any recommendations on how to do this?

Dec 09 '23 12:12 jondot

It looks like https://github.com/huggingface/candle/pull/1370 might solve this issue for the quantized version of llama. You could clear the cache after every request and then keep generating

A different approach is separating the history of the session from the model entirely. Then after you are done with the history, you can still reuse the tokenizer and model without resetting anything. That is what I do in Kalosm here. That lets you generate text without changing the state of the model after the text is generated. Separating the history also lets you serialize and deserialize the history which may be useful if you want to resume text generation quickly after a disconnect.

Edit: I also created a streaming text generation sever that uses candle here

Dec 10 '23 22:12 ealmloff

Here is another example to reference.

The model loads when the server starts so that multiple users can connect to the same instance.

I'm just passing a &model and then .clone()ing it.

Dec 11 '23 07:12 danielclough

@danielclough thanks! @ealmloff thanks! Kalosm looks great, I'll try to use it directly. looks like you use both llm-rs and candle. whats your impression?

Dec 12 '23 10:12 jondot

@danielclough thanks! @ealmloff thanks! Kalosm looks great, I'll try to use it directly. looks like you use both llm-rs and candle. whats your impression?

llm-rs is faster, but it supports less models and it is less controllable. llm-rs only exposes code for basic text generation, so you cannot save a chat history cache or use constrained generation.

Dec 12 '23 14:12 ealmloff

@ealmloff just coming back to say - kalosm is REALLY REALLY great! just integrated into a service flawless. I didn't have a tokio runtime crash due to tokio shutdown with reqwest (like i had with other infrastructure), I really think it should be a basis for candle itself for how to make it accessible. Also -- happy if you could release a version for kalosm itself (I'm using git) Kudos!

Dec 12 '23 15:12 jondot

@ealmloff just coming back to say - kalosm is REALLY REALLY great! just integrated into a service flawless. I didn't have a tokio runtime crash due to tokio shutdown with reqwest (like i had with other infrastructure), I really think it should be a basis for candle itself for how to make it accessible.

Thanks! I'm glad it works well for you. Let me know if you run into any issues

Also -- happy if you could release a version for kalosm itself (I'm using git) Kudos!

I'm working on adding some documentation here. After that is finished, I plan to release 0.1.0

Dec 12 '23 19:12 ealmloff

Fantastic stuff! Thanks for the help and sorry for the trouble ❤️

Dec 13 '23 16:12 jondot

@jondot , perhaps you could check out candle-vllm?

Jan 20 '24 15:01 EricLBuehler

@EricLBuehler will do, i'm getting back to this topic now, trying to experiment with other models. @danielclough im wondering what would be the cost of cloning? now that I want to try every model in candle (not just llama family models), seems that this would be the best technique (other than reimplementing/patching the models that the candle team created)

Mar 31 '24 13:03 jondot

Meanwhile I did a test with mistral, it takes an order of 1-1.5ms to clone a fresh loaded model instead of 100us. I believe it's a considerable overhead for Rust (i.e. Rust doing some hard work cloning the tree)

Mar 31 '24 16:03 jondot

candle candle copied to clipboard

Model reuse in TextGeneration examples

candle
candle copied to clipboard