GlaDOS Responses ending in "<|eot_id|><|start_header_id|>user<|end_header_id|>" or "<!--end

Anyone else getting a fair number of GlaDOS's responses ending in some gibberish about headers? I'll try to filter with regex as at least now after switching to espeak_binary branch those ending items are mostly encapsulated in <> so maybe can just filter with regex, but still others seem more 'freeform' and sometimes even flood repeat like: 2024-05-10 01:54:25.237 | SUCCESS | main:process_TTS_thread:343 - TTS text: . 2024-05-10 01:54:25.241 | SUCCESS | main:process_TTS_thread:343 - TTS text: . 2024-05-10 01:54:25.246 | SUCCESS | main:process_TTS_thread:343 - TTS text: . 2024-05-10 01:54:25.250 | SUCCESS | main:process_TTS_thread:343 - TTS text: . 2024-05-10 01:54:25.255 | SUCCESS | main:process_TTS_thread:343 - TTS text: . 2024-05-10 01:54:25.259 | SUCCESS | main:process_TTS_thread:343 - TTS text: . 2024-05-10 01:54:25.263 | SUCCESS | main:process_TTS_thread:343 - TTS text: .

anyone else seeing this or found a fix?

May 09 '24 20:05 bitbyteboom

I've seen some weird responses from GLaDOS sometimes. You can interrupt her, and she will usually sppoloyfir glitching! It's probably a stop token issue. On the radar, but I'm working on higher priority issues for the moment.

May 10 '24 04:05 dnhkng

"and she will usually sppoloyfir glitching! " haha :) Yep, totally understand there are much bigger fish to fry at this stage, was just curious if there was something obvious being missed. Will play with the bit of code cleaning up the text that goes to TTS in meantime as a quick patch.

May 10 '24 13:05 bitbyteboom

adding <.*?> to the cleanup code in glados.py got rid of the end token stuff making it to TTS, in case that helps someone. It may still be in the context, haven't looked closer at it yet.

May 10 '24 14:05 bitbyteboom

I will look into this. Which model were you using? 8 or 70B?

May 11 '24 06:05 dnhkng

8B 8bit quant. The patch above has improved output quite a bit. I also added it to the context, but GlaDOS gets consistently weird after a dozen or so interactions. In hindsight, obviously a bad idea as it strips needed notations like "< INTERRUPTED >" etc. Seems some other cruft is making it into her context creating brain damage. Largely shows up after 5 to 10ish interactions by user. Generally seems to be a need to take a closer look at what is making it to context.

May 11 '24 13:05 bitbyteboom

To see that, change the logging level. I have the current context and lots of other information logged at the 'info' level, so you can track that easily.

May 11 '24 14:05 dnhkng

Looked at Debug logging and didn't see anything obviously wrong, but she was still brain damaged and had no ability to recall a passphrase I provide at start of conversation even under a thousand tokens later. So I tried the following: LLAMA3_TEMPLATE = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = message['role'] + '\n\n'+ message['content'] | trim %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ 'assistant:\n\n' }}{% endif %}"

Basically got rid of all the <|start_header|> etc type tokens. This made a huge difference with her bugging out far far less deeper into conversation. Still bad passphrase recall, but maybe more of the "role: 'user', 'content': " type items need to be removed as well. TBH I don't have a clear idea of what I'm doing but got a strong hunch that a plain text approach of "User: {user's content} \n\n Assistant: " may be not only cleaner but actually functional when it comes to her not becoming a total derp shortly into the conversation.
Of course I may be totally barking up the wrong tree, as I'm not very versed in talking with llama.cpp but there may be an issue with context length / rope scaling or something else instead causing the inability for recall. I know the base model (llama3 8b Q8) is excellent with recall with far more tokens in between when run through text-gen or kobold etc. Also can't seem to iron out the strong tendency for her to fill in for user frequently. Having seen this in the oobaboga but was fixed with better model settings, so kind of wish we had a config to control model llama settings to match ideal for given model. Unless they're picked up from the gguf somehow? Don't know enough on that.

May 11 '24 20:05 bitbyteboom

Maybe just stop her from generating as soon as she starts saying "User: .." and strike that bit from her context, so it just ends at the end of her response..

May 11 '24 20:05 bitbyteboom

The template should be the official chat template used to train Llama-3. I also need to check the default context length for the llama.cpp[server]!

I will look into that today.

May 12 '24 08:05 dnhkng

Yep. the context length was at the default length of 512!

I have made that a parameter, and extended the length to 8192, and is seems to be much better after a few rounds of conversation.

The next step would be to start deleting the message history as it exceeds 8192 tokens, but there are a lot of options here (truncation, summarization etc etc).

May 12 '24 08:05 dnhkng

GlaDOS
GlaDOS copied to clipboard

Responses ending in "<|eot_id|><|start_header_id|>user<|end_header_id|>" or "<!--end_eot}}"

GlaDOS GlaDOS copied to clipboard

Responses ending in "<|eot_id|><|start_header_id|>user<|end_header_id|>" or "<!--end_eot}}"

GlaDOS
GlaDOS copied to clipboard