Jlama icon indicating copy to clipboard operation
Jlama copied to clipboard

Support Closing Sessions to Free Resources and Fix Issue #140"

Open cdhermann opened this issue 1 year ago • 9 comments

  • Allows for closing certain session and therefore freeing the associated resources
  • Fixes https://github.com/tjake/Jlama/issues/140

cdhermann avatar Jan 05 '25 10:01 cdhermann

Thanks for this. I wonder if the KV cache should be marked ephemeral when the model is created? Otherwise you can never keep the cache around long term (say you want to store threads of different conversations to go back to)

tjake avatar Feb 22 '25 17:02 tjake

Thanks for this. I wonder if the KV cache should be marked ephemeral when the model is created? Otherwise you can never keep the cache around long term (say you want to store threads of different conversations to go back to)

Perhaps it's possible to achieve the best of both worlds: short-lived sessions that can be deleted and long-lived sessions that can be resumed by explicitly modeling the concept of a session.

E.g. something like that

/** 
 * Represents a session with a unique ID and persistence setting.
 */
public record Session(UUID sessionId, boolean persistent) {

    /**
     * Creates a persistent session with the provided session ID.
     * 
     * <p>
     * This session can be resumed even after the program exits.
     * </p>
     */
    public Session(UUID sessionId) {
        this(sessionId, true);
    }

    /**
     * Creates an ephemeral session with a new random session ID.
     * 
     * <p>
     * All resources are freed when the session is closed.
     * This session cannot be resumed later.
     * </p>
     */
    public Session() {
        this(UUID.randomUUID(), false);
    }
}

....

AbstractModel model = ModelSupport.loadModel(localModelPath, workingMemory, workingQuantization);

// Creates an ephemeral session
Session session = new Session();

Generator.Response response = model.generate(session, ctx, 0.1f, 1024, (s, f) -> {
    // Handle generation callback
});

/*
 * Closes the the given session
 * - Persistent sessions: No deletion of the temporary files
 * - Ephemeral sessions: Deletes temporary files and marks them for deletion on exit
 */
model.close(session);

cdhermann avatar Feb 22 '25 20:02 cdhermann

Since Jlama provides LangChain4j integration, their expectations regarding this integration should also be considered. Based on my understanding of the LangChain4j chat memory documentation, there is no default persistence. However, I must admit that I haven’t explored the LangChain4j integration and its usage in depth yet.

cdhermann avatar Feb 22 '25 20:02 cdhermann

Based on my understanding of the LangChain4j chat memory documentation, there is no default persistence.

Correct, in this case it would always be ephemeral. But for Jlama I want to handle stored sessions. I can take a crack at fixing this based on your PR!

tjake avatar Feb 22 '25 21:02 tjake

@tjake, I think that we can have the best of two worlds (ephemeral and persistent) giving power for user, choosing what he/she prefer.

DumiJDev avatar Sep 13 '25 21:09 DumiJDev

Yes agreed

tjake avatar Sep 13 '25 22:09 tjake

If you take VLLM they have a shared KVcache. Users are encourage to set a cache_salt if they want to ensure people cant "guess prompts" by looking at the timings of requests in multi-user environmenments. There is no concept of user here. The generate is a guuid each time. I think it becomes important to request to a time, sharing thecache IS a good thing. If three people are working on the same problem they can share a pre-shared SHA salt with each other. Since it is a cache expiring by time and volume makes the most sense to me. Simply you dont want it to grow boundlessly big, everything else is a per-use-case optimization

edwardcapriolo avatar Oct 22 '25 11:10 edwardcapriolo

Take a look at this. "cacheSalt". The idea here is in multi-user envs I can "guess other prompts, by looking at the timings of the response.

https://github.com/edwardcapriolo/deliverance/pull/6

We all share the cache which makes sense, unless we dont want to then we use a "cache_salt" and the cache is private to those who know the sha.

https://docs.vllm.ai/en/stable/design/prefix_caching.html

My next work is to run a background thread to clean up old entries. We can expire by age or even size.

edwardcapriolo avatar Oct 23 '25 23:10 edwardcapriolo

A further improvement dedicated kvcache: https://github.com/edwardcapriolo/deliverance/pull/new/dedicated_kv

edwardcapriolo avatar Oct 25 '25 13:10 edwardcapriolo