guidance
guidance copied to clipboard
Unexpected behavior from `select`
The bug
This isn't technically a bug, but it isn't really a feature either. I'd call it an unexpected behavior due to an implementation choice. When select is called, it greedily decodes using tokens that are consistent with the set of choices. There can be choices that appear consistent with the request in the initial segment, but then go very wrong later in the decoding.
In the example below, I ask the LLM to produce a word "very similar to sky". This has a very high probability of actually just producing the word "sky" again, but it cannot because my select is only over the words "skill" and "cloud".
A human, given those two options, would always select "cloud" because clouds are in the sky, however the LLM with the current implementation of select will always produce "skill", because the first token "sk" of "skill" matches with the first token "sk" of "sky". The model will greedily produce the "sk", which it then must complete with "skill", leaving the far better choice of "cloud" not generated.
Proposed solution
Something far more consistent with human predictions would be to generate the tokens accumulating the probability, and then selecting based on the probability at the end of the generation. Even better, would likely be to normalize each option by the probability that the LLM unprompted would produce each option of the select, and thus selecting the option with the largest shift in distribution given the context (sorta a bayesian thing), however this would likely be computationally prohibitive.
I would view this as an API breaking change, because it would likely drastically change the responses of "select". I'm not sure what the stance of the project currently is on such changes, but it could be added by appending something like a select_mode argument which defaults to "greedy" and accepts "lazy" or some similar name for this alternate mode.
I'm happy to contribute the PR, but I'm not really familiar with the codebase yet and it isn't clear to me where I'd need to modify.
To Reproduce
from guidance import models, select
lm = models.LlamaCpp("solar-10.7b-instruct-v1.0.Q6_K.gguf", n_gpu_layers=-1,n_ctx=4096) #this is resilient to model choice, and I observe it across all local LLMs I test, in either direct prompting or using the chat interface and an instruction prompt
lm + 'A word very similar to "sky" is "' + select(["cloud","skill"]) #expect "cloud" get "skill"
System info (please complete the following information):
- OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): Mac OS
- Guidance Version (
guidance.__version__): 0.1.10
Generation is not even needed when comparing several options. One can do a single forward pass for each option, take all next-token logprobs, and combine them into the probability of the whole sequence. Then, compare the probabilities of each sequence and return the most likely.
Is there any reason why select is not implemented this way? It seems to be both the most efficient approach and "the most correct" from a mathematical perspective.
Hey! Great questions here. The current behavior was chosen because it is most consistent with a direct biasing of the token probabilities. In other words it exactly filters the next tokens according to what is valid. This means that the select works the same way as a regex (or any other function we come up with) that specifies the same behavior.
As you point out @Koloth there can be problems when the model starts with a token because it is planning to use a followup token, but then that followup token is blocked. This is what happens in your example above. The model wants to write "skies" using the two tokens "sk" and "ies" but gets stuck in "skills" since it can't use "ies". You could attempt to address this using look ahead, essentially doing what @prompteus suggests and computing each of your options in full and then comparing their log_probs. The problem with that approach is that things are often not very well normalized so it comes with it's own problems (I could make the model avoid an option just by adding filler words to it).
In my opinion the best solution is to help the model anticipate the choices it is about to make, that way it will not get stuck going down the wrong path and is more likely to make good reasoning choices. For example in this trivial scenario:
So if you start running into these problems I think your best bet is to steer the model better with more prompt guidance, rather than to try and spend compute to do forward look ahead.
Now...all that said, I think we should of course support forward look ahead style options. The thing holding us back here right now is that guidance does not compute the logit values for tokens that are entirely forced right now. We can fix that, but it will require a bit of reconfiguring of how the parser records logits so that we can still efficiently send batches to the GPU/TPU. I am adding an "enhancement" label to this issue to mark that we need to support that. Once we support this we can do exactly what @prompteus suggested in a for loop like:
from guidance import capture
lm += prompt
log_probs = []
for option in options:
tmp = lm + capture(option, name="val")
log_probs.append(tmp.log_prob("val"))
Sounds perfect! I look forward to seeing it implemented.
What @slundberg suggests seems to me like a good workaround. Still, I am happy to hear that there are plans to support log_prob calculation as well.
@slundberg it may get expensive, but I was thinking that once the parser fully handles logits, it could be nice to add some optional beam search to gen and/or select (or even at the model level if someone wants to do this somewhat globally). We probably don't want to naively expand select calls out to every possible outcome, especially if there's excessive branching in the grammar. Any thoughts? :)