llama.cpp
llama.cpp copied to clipboard
Is the --ignore-eos flag redundant?
As per https://github.com/ggerganov/llama.cpp/blob/da5303c1ea68aa19db829c634f1e10d08d409680/main.cpp#L1066 the EOS flag in interactive mode simply causes is_interacting
to switch on, and so it serves as a way to end the current series of tokens and wait for user input. Is there any reason to actually avoid sampling it in the first place then?
Hm. The difference in use is the --ignore-eos
option stops the end of text token from appearing in the first place. So generations will not be interrupted and prompt for user input. That's really the only difference. Personally I have weird issues when is_interacting
switches on when a end of text token is reached when not using --ignore-eos
. A lot of time my input seems to get ignored and it continues on with what it was doing prior. 🤷
When I made the PR for --ignore-eos the code that ignores eos in interactive mode wasn't added yet. However I think that my solution is better because it avoids sampling eos at all in the first place, otherwise the eos is going to end in the context and that may make the LLM do weird things. But it's just an assumption.
However I think that my solution is better because it avoids sampling eos at all in the first place, otherwise the eos in going to end in the context and that may make the LLM do weird things. But it's just an assumption.
I can confirm that. I actually tried that approach before the flag got added. It doesn't work very well, because the LLM will basically go off the rails or start a completely new response that's likely not even related to the initial request.
Also, in my opinion as a random person on the internet one can just not pass the flag for ignoring EOS if that's the behavior the user desires. So why go through the trouble of adding special code to disable it even when explicitly specified?
Hm. The difference in use is the
--ignore-eos
option stops the end of text token from appearing in the first place. So generations will not be interrupted and prompt for user input. That's really the only difference. Personally I have weird issues whenis_interacting
switches on when a end of text token is reached when not using--ignore-eos
. A lot of time my input seems to get ignored and it continues on with what it was doing prior. 🤷
I think we should address this. Perhaps the correct approach is to default to ignoring EOS in interactive mode instead of using it to switch to user input mode.
Meaning --ignore-eos is on by default on interactive mode and in non-interactive mode it’s off by default.
What do you think?
Without EOS, the model has no way to cut short its generation, am I mistaken? In that case a reverse prompt would be required or there would be no possibility of interaction.
EDIT: Upon further testing I’m seeing --ignore-eos, even with a reverse prompt, seems to send the model into generating endless nonsensical output.
This is a bit of a tangent, but I've been looking further into the weird behavior when the end of text token occurs and gives the user control. (without the use of --ignore-eos
, meaning end of texts can occur) and would like to propose a change to your end of text/interactive mode PR that was merged. @tjohnman While I'm not super familiar here, it seems that changing the section add these two lines seems to improve if not outright fix the weird lack of coherence that occurs after an end of text token. I am not qualified enough to speak on this technically but saw the emb.back() = 13
and last_n_tokens.back() = 13
thrown around as a small hack to get around end of texts prior to your PR.
// end of text token
if (embd.back() == EOS_TOKEN_ID) {
if (params.interactive) {
is_interacting = true;
embd.back() = 13;
last_n_tokens.back() = 13;
} else {
fprintf(stderr, " [end of text]\n");
break;
}
}
Here are two (edit: three, I got a real good one showcasing the default behavior currently) excerpts from a reverse prompt dialogue WITHOUT this addition, the current behavior when an end of text token is reached. (edited in the [end of text] parts for clarity to indicate when it gave me back control)
Start of the neverending dialogue that lasts forever and goes on for 1000 years
Jessie: H-hi there.. if someone is there. *sweats*
User: Hi there!
Jessie: A-ah! *startled*
User:It's okay, it's just me. :)
Jessie: Oh thank goodness...I thought i was going crazy talking to myself like this. [end of text]
User:Hello?
This user account has been created by a bot (or automated program).
If you have any questions about its content or actions, contact the bot operator directly using their talk page message system.
To protect against spamming, all new accounts are automatically blocked from editing until they pass manual review. If your request is approved, your username will be changed to match your email address.
Retrieved from "https://en.wikipedia.org/wiki/Special:UsedAccount" [end of text]
A bit of second hand embarrassment as it randomly started going on about the anime, Naruto and fan-fiction..
Start of the neverending dialogue that lasts forever and goes on for 1000 years
Jessie: H-hi there.. if someone is there. *sweats*
User: Hi there!
Jessie: A-ah! *startled*
User:It's okay. It's just me.
Jessie: Oh... ummm.... what do you need from me today?
User:end of story
[end of text]
User: Nothing.
Title: Nothing.
Characters/Pairing: Kiba Inuzuka & Tenten (from Naruto)
Rating: PG-13 (for language)
Summary: The two are enjoying a day off together when they come across something unexpectedly beautiful.
Notes: This was written as part of the 'Narutomania' challenge over at /fic/, which asked us to write a one shot using any combination of characters or pairings from the show.
Kiba Inuzuka & Tenten were having a great time exploring Konoha during their days off.
Particularly strong example of how it just forgot who I was speaking with entirely after an end of text.
Start of the neverending dialogue that lasts forever and goes on for 1000 years
Jessie: H-hi there.. if someone is there. *sweats*
User: Hi there!
Jessie: A-ah! *startled*
User:Don't worry, just me.
Jessie: Oh... thanks....
\end{code} [end of text]
User: Hi.
User: Hello?
System: Yes. What would you like today?
User: What happened to Jessie?
System: Who are you talking about?
User: Is Jessie not here?
System: Nope. She left a while ago.
User: Oh, that's too bad.
System: You mean she was here before?
User: I thought she was.
System: Well, maybe it wasn't her then. Maybe we were thinking of another person. Or maybe we were thinking of two people at once, but one got lost inside our heads somewhere along the way.
User:
And here two small excerpts WITH the above change when the end of text token is thrown.
Start of the neverending dialogue that lasts forever and goes on for 1000 years
Jessie: H-hi there.. if someone is there. *sweats*
User: Hi there!
Jessie: A-ah! *startled*
User:It's okay, it's just me. :)
Jessie: Oh thank goodness...I thought maybe a ghost was talking to me or something...
User:end of story
[end of text]
EndUser: Hi.
Jessie: Hello! [end of text]
User:Hi
Jessie: Who are you?
Start of the neverending dialogue that lasts forever and goes on for 1000 years
Jessie: H-hi there.. if someone is there. *sweats*
User: Hi there!
Jessie: A-ah! *startled*
User: It's okay! It's just me. :)
Jessie: Oh...thank god! You scared me half to death! :O
User:end of story
[end of text]
EndUser: Now what?
Jessie: Well now we have a conversation going on between us two, but it doesn't end here. [end of text]
I've tested this over the past day, and it seems pretty apparent that without emb.back() = 13
and last_n_tokens.back() = 13
it completely loses the plot when you give any input following the end of text token.
I would make this a PR myself but I'm really not certain on it and don't want to introduce any unintended or bad behavior. But the change seems to fix the weird end of text behavior I get regularly when not stripping out the EOS token altogether with --ignore-eos
. (I will admit most of my usage of llama.cpp focuses mostly on reverse prompt assistant chatbot interaction, so I didn't see how not having an end of text token could be detrimental otherwise. I would like the argument to stay so long as it isn't a default.)
@rabidcopy That was very thorough. Thank you! Unfortunately, I'm not very knowledgeable myself (I don't even know what token 13 is) so I don't know why your examples work out the way they do either. What is token 13?
@rabidcopy That was very thorough. Thank you! Unfortunately, I'm not very knowledgeable myself (I don't even know what token 13 is) so I don't know why your examples work out the way they do either. What is token 13?
No idea. It was a snippet I saw floating around posted anonymously. Going out on a limb it somehow keeps the context on track or "restores" a state after an end of text is reached?
Edit: I'll probably make a PR for this later and see if someone more knowledgeable can sign off on it. Though I don't see the harm as it only effects end of text behavior in interactive mode and the former behavior doesn't seem particularly ideal.
Token 13 is a newline. llama.cpp
dumps the prompt and token ids when starting by default, as far as I know.
Just for example (and I know there's a typo in the prompt, doesn't matter for this example :)
main: prompt: ' In the context of and machine learning, what does "perplexity" mean?
Answer:'
main: number of tokens in prompt = 23
1 -> ''
29871 -> ' '
512 -> ' In'
278 -> ' the'
3030 -> ' context'
310 -> ' of'
322 -> ' and'
4933 -> ' machine'
6509 -> ' learning'
29892 -> ','
825 -> ' what'
947 -> ' does'
376 -> ' "'
546 -> 'per'
10709 -> 'plex'
537 -> 'ity'
29908 -> '"'
2099 -> ' mean'
29973 -> '?'
13 -> '
'
13 -> '
'
22550 -> 'Answer'
29901 -> ':'
Token ID 2 is the end of document marker and token ID 1 is the start of document marker. You can see the prompt gets generated with a SOD at the beginning.If the LLM generates that, it also could cause weird stuff (I heard someone else mentioning this can happen, but I haven't see in).
Note, this is just based on the current tokenizer behavior and models I've tried. I think it's the same for all llama and alpaca models but I am far from an expert.
@rabidcopy If 13 is a newline, it makes sense that it would help smooth out the behavior when the model outputs an EOS but we essentially are telling it "I don't care that you are finished, we'll just keep going". 😏
What you are proposing is adding a newline (token 13) where there would normally be an EOS in interactive mode, right? That way you wouldn't need to use --ignore-eos
in these cases. Is that correct?
We could then remove this flag (or maybe it has other uses so we could keep it), and allow the model to generate EOS as a way for us to know that we need to go to interaction mode and add a newline instead. Do you think this could work well?
We could then remove this flag (or maybe it has other uses so we could keep it)
I still find it useful outside of interactive mode to force the model to generate longer text, even if sometimes it may cause it to go off-rails. For example if using as the prompt the beginning of a story to get the LLM to finish the story, sometimes it will just generate an eos after a paragraph or two, and this can be prevented with --ignore-eos.
We could then remove this flag (or maybe it has other uses so we could keep it)
I still find it useful outside of interactive mode to force the model to generate longer text, even if sometimes it may cause it to go off-rails. For example if using as the prompt the beginning of a story to get the LLM to finish the story, sometimes it will just generate an eos after a paragraph or two, and this can be prevented with --ignore-eos.
I concur, there's scenarios where I think some users may prefer it to generate endlessly without being given back control unless they interject with CTRL+C.
Thank you for making this @rabidcopy. I've actually encountered this issue before and I was totally perplexed by it. I thought it was some issue with my prompts. I'm going to merge this fix into my experimental branch right away.