General improvements
This is a placeholder / "intent to merge" PR for now, for awareness. Please have a look and see how much you want this broken up into separate PRs, or what you want to keep / remove, or if you're happy to merge it with minimal changes ;)
Mostly this is:
- various small fixes for improved robustness (error handling, error logging)
- better handling of LLM / TTS model bugs/hassles (e.g., hallucinated text that the speech recognition engine hears; and multiple LLM stop tokens)
- User configuration support, as a user_config.py file that overrides the built-in defaults (example file included)
- Configuration classes in the code that pass config to what needs it rather than using global variables (this also opens up possibilities for extended functionality in future)
- Abstraction of the Lllama LLM so that it can run llama.cpp itself, or it can simply connect to a remote LLM (via a config option)
Most of the commits are small/atomic/clean, but the last one is still WIP.
I intend to do more work in my branch and PR it, to abstract out the backend speech to text and text to speech engines, and to separate the client (microphone/speaker) and frontend server (glados logic / API) parts, FYI.
I'm very happy to incorporate the smaller changes. Thanks for the help!
Just a few small points: On the hallucinations/VAD sensitivity - I've never had these issues. I saw someone else with the opposite, where the last word was being clipped. I'm starting to think these are hardware issues. I'm not sure we want to write software fixes for all the different ways microphones might cause problem.
The second is the external LLM support. I'm in two minds about this. Yes, is would allow more people to use the system, but it also break the key reason for the project, a 'local GLaDOS'.
Any chance you could break down your changes into 'bug fixes and small improvements', and 'new features'? That would make a code review possible.
The second is the external LLM support. I'm in two minds about this. Yes, is would allow more people to use the system, but it also break the key reason for the project, a 'local GLaDOS'.
Actually, setting an external server URL in LLAMA_SERVER_URL and adjusting LLAMA_SERVER_HEADERS with the corrent Bearer headers already allows me to use an external llama server. Anything I'm missing why a change here is necessary?
Yes, I know, it's a small change. What I'm thinking about more is removing the Llama.cpp server code completely (llama.py), and letting users use a third-party API (Groq or OpenAI) or fire up a local Ollama server with the LLAMA_SERVER_URL etc. That would make the repo smaller and more flexible, but I'm not familiar with Ollama.
On the one hand, it seems like a simpler solution, as there were a lot of requests on Reddit asking for assistance compiling llama.cpp. On the other hand, it's another layer of indirection, and ollama barely mentions the llama.cpp backend, and I don't think that's cool...
That totally makes sense. Thanks you!
Personally, I would appreciate the ability to hook this front-end into a separate backend (I run models locally with oobabooga/text-generation-webui). As a plus, it would allow me to fine-tune a model on GLaDOS' game dialogue to better emulate her character. I'm fairly certain most backends use an OpenAI-style API though, which may not be as quick as the current implementation.
ooba is also a potential approach, but that would mean deciding on which endpoint to use in the configuration. With llama.cpp[server], I noticed the Llama-3 chat template didn't work correctly, so I implemented a message dictionary processor. I'mm not familiar with how ooba like to have it's messages delivered. It it the completions or chat/completions endpoint, or are both available?
Lastly, if you wanted to use a character prompt in ooba, should we then have a flag to ignore the current prompt?
Personally, I would rather have you help design an amazing prompt in this repo, and potentially fine tune a custom model for us! Ooba is quite a general platform, I see this project as highly tailored to build a great GLaDOS. Of course, we could consider a bunch of 'personality cores' in the future... Dump in voice data and backstory text, and it auto-generates a full voice and character...
If i remember correctly, ollama has one of the easiest learning curves, and uses very simple POST requests for chatting. It can be set up on practically all OS's, which makes it pretty easy. Plus they have good documentation.
The main reason for the
The second is the external LLM support. I'm in two minds about this. Yes, is would allow more people to use the system, but it also break the key reason for the project, a 'local GLaDOS'.
Actually, setting an external server URL in
LLAMA_SERVER_URLand adjustingLLAMA_SERVER_HEADERSwith the corrent Bearer headers already allows me to use an external llama server. Anything I'm missing why a change here is necessary?
The main reason is this part of the diff:
- if not self.llama.is_running():
- self.llama.start(use_gpu=True)
+ if not LLAMA_SERVER_EXTERNAL:
+ if not self.llama.is_running():
+ self.llama.start(use_gpu=True)
The original code seems to potentially launch a large language model even if that's not the user's intent -- even if there's only a temporary connection error to the already running LLM, for example. I would very much like to avoid that, because I run a single centralised LLM and share it between multiple clients (open-webui, aider, now GlaDOS, etc.), and my goal would be to run something like a GlaDOS client in each room, plus mobile, all connecting to a central server (over VPN where needed).
Any chance you could break down your changes into 'bug fixes and small improvements', and 'new features'? That would make a code review possible.
Absolutely, will work on it this weekend. No worries if you don't want to merge any parts, I'll maintain those in my own branch, and rebase onto the upstream merged stuff.
If i remember correctly, ollama has one of the easiest learning curves, and uses very simple POST requests for chatting. It can be set up on practically all OS's, which makes it pretty easy. Plus they have good documentation.
https://www.reddit.com/r/LocalLLaMA/s/EhY7SkJMoi
This was an interesting discussion over Ollama. It looks like it's really not needed.
If i remember correctly, ollama has one of the easiest learning curves, and uses very simple POST requests for chatting. It can be set up on practically all OS's, which makes it pretty easy. Plus they have good documentation.
https://www.reddit.com/r/LocalLLaMA/s/EhY7SkJMoi
This was an interesting discussion over Ollama. It looks like it's really not needed.
Yeah, ollama is just a wrapper around llama.cpp. It's handy and suits well for some people, but is getting too much credit for a wrapper.
To me, the key is that there are lots of ways to use a third-party OpenAI-compatible API, or to run an OpenAI-compatible API locally, including llama.cpp, kobold.cpp, ollama, KoboldAI, LocalAI (quite complete, including tts and stt), OpenedAI-speech (just the TTS/STT parts), vllm, etc., and so just using that API externally, or providing that API simply and easily if someone doesn't already have a suitable API of their own set up already, is probably the best route.
The only issue that needs to be addressed is in how chat history is processed.
Until recently, Llama-3 chat prompting was broken, and llama.cpp[server] does not handle most chat formats. Even now, although it in theory can process the llama-3 chat template, I find its a bit broken (after a few rounds of chatting, it breaks).
That's why instead of using the "/v1/chat/completions" endpoint, I convert the system prompt and chat history to a properly chat-formatted string, and use the basic "/completion" endpoint. This was also the recommendation of the llama.cpp chat-formatting author, when I was working on a different llama.cpp project.
So, we have 1) the choice of backend, and 2) whether or not we pre-format the system prompts and chat history.
Lastly, during the installation process, we should have the option of a basic default configuration, e.g. we have llama.cpp as a sub-repo, and compile that with the right flags, and download a 'decent' model based on the users GPU. Maybe Phi-3 for Intel CPU only, LLama-3 8B for Mac and regular gaming GPUs, and Llama-3 70B for people who have invested wayyyyy too much on LLM hardware...
The user could also choose not to install a backend, is they wanted to use commercial API. Maybe we could at this point store tokens as environment variables, so nothing is accidentally save to the repo?
To me, the key is that there are lots of ways to use a third-party OpenAI-compatible API, or to run an OpenAI-compatible API locally, including llama.cpp, kobold.cpp, ollama, KoboldAI, LocalAI (quite complete, including tts and stt), OpenedAI-speech (just the TTS/STT parts), vllm, etc., and so just using that API externally, or providing that API simply and easily if someone doesn't already have a suitable API of their own set up already, is probably the best route.
I think that's OK for the LLM (as few people can run 70B or even Nous Capybara), but I think that running whisper and TTS locally would be a fair requirement. I understand that there are already ASR and TTS plugings to SillyTavern etc, and I don't want to be just an alternative to those projects.
This project should really be focused on making GLaDOS incredibly clever and interactive (real-time vision and ability to interrupt the user etc), not necessarily to make it as widely adopted as possible just yet. That will come naturally as Microsoft 'AI PC'-ready hardware with 40+ TOPS become standard.
I just pushed some big changes that overlap with some of the changes here, such as:
- YAML config file
- separating GLaDOS from the LLM
- Use any LLM API
- better organisation with optional submodules
Still looking forward to your PR, or smaller PRs for specific fixes/functionality.
I will close this PR, as I understand that most of the features are now in main. I would welcome small PRs that cover the other features though!