serge
serge copied to clipboard
Use interactive mode of llama.cpp for better performance + ask multiple chats at once
Big PR.
Functionally:
- You can now keep conversations going without reloading the entire prompt.
- You can have multiple conversations answering at once.
- Each conversation has its own persistent thread loaded as needed when prompted.
Caveats:
- Each thread has its own memory needs, which means if you open too many threads you will run out of memory. There is a limit of 4 active threads currently hardcoded, will be made into a parameter later.
This also clears a path for a future implementation with one web server doing API + front end, and a multitude of worker nodes, all independent and talking to the API server through redis.
Heya, I was testing this out locally, and while it's working for the most part. There are times that it doesn't seems to detect that the response has ended. I.e. in the networking console, I see it still polling /stream and if I try to add a new prompt, it removes the previous answer and then hangs.
Don't think I'll be working on this for now on. Maybe in the future but for now it's a dead end, I'm closing this.