alpaca.cpp
                                
                                
                                
                                    alpaca.cpp copied to clipboard
                            
                            
                            
                        The model can't seem to keep track of a conversation.
The program doesn't seem to "remember" what was said previously, so it's difficult to maintain conversational flow. This example was generated with the 13B model, but the same happens with the 7B one as well.
(Running on Windows 11 with WSL2 Ubuntu, weights downloaded from the provided magnet links)

If you're willing to manually retype the conversation history, then you can get your question answered, like so:

If you're willing to manually retype the conversation history, then you can get your question answered, like so:
Thanks! I guess that'll do for now. Hoping that it is integrated within the program itself... I don't think the original llama.cpp repo has this issue.
After playing around with it some more, I'm somewhat more confused -- but I no longer think that the model doesn't have 'conversational memory'.
Also, the chat.cpp file is identical in this repo vs the one it was forked from, so that suggests that the chat logic is the same

Yet even if it can sometimes 'remember previous conversation', it does so only very intermittently, so imo your original report is basically correct, there is a lot of engineering work we can do here to improve the model's conversational memory
I am working on a version that more explicitly conveys the idea to Llama that there is a single-threaded conversation and its job is only to respond to the user. Curious whether anybody else has made any kind of significant progress with this.
I have also seen a few cases of indisputable conversational memory across 2 or 3 separate questions, but it's been very rare. No time to work on this myself, unfortunately, but I look forward to seeing what folks come up with to make it a properly conversational tool.
I guess the biggest problem will be - the "emulated" conversational memory, i.e. when you add the whole (or just summary of) your previous conversation as a part of your prompt, will quickly hit the limit of number of tokens this model can take as an input.
This video explains it quite nicely - https://www.youtube.com/watch?v=VW5LBavIfY4&feature=youtu.be
I am working on a version that more explicitly conveys the idea to Llama that there is a single-threaded conversation and its job is only to respond to the user. Curious whether anybody else has made any kind of significant progress with this.
https://github.com/deep-diver/Alpaca-LoRA-Serve
Implements a functional context system and has a demo running on a cloud instance which shows promise. My local testing shows that alpaca.cpp looks like it doesn't remember history, which makes me confused about the -c and --ctx_size params for alpaca.cpp because they clearly don't work. Their(LoRA-Serve) implementation is targeted towards CPUs with the VRAM capacity to run these models, unlike the CPU based alpaca.cpp. Seeing it refactored for CPU applications would be nice.