guidance icon indicating copy to clipboard operation
guidance copied to clipboard

Explain the efficiency in docs

Open lucasgadams opened this issue 1 year ago • 5 comments

Is your feature request related to a problem? Please describe. I don't see anywhere in the docs an explanation on how structured outputs that interleave prompt and generation actually works, and what is the efficiency implications. For instance, the first example given:

# define a guidance program that adapts a proverb
program = guidance("""Tweak this proverb to apply to model instructions instead.

{{proverb}}
- {{book}} {{chapter}}:{{verse}}

UPDATED
Where there is no guidance{{gen 'rewrite' stop="\\n-"}}
- GPT {{gen 'chapter'}}:{{gen 'verse'}}""")

To someone who doesn't know better, this might seem like a standard GPT prompt that gets executed in 1 pass. But if you dig deeper, you'll see that this example requires 3 GPT calls (makes sense, no other way to do it currently). Every call requires all of the context before it, so this is essentially 3*N in terms of compute time and price. This is a bad idea if you have a very long agent type prompt. I can imagine someone having a 6k long prompt that requires a json object at the end, and putting in their own keys and requesting 10 filled in fields. That would be very slow and expensive!

Describe the solution you'd like Better documentation on how everything works under the hood. Which models are able to achieve actual speedups, and which models would be a bad idea to structure prompts like this.

lucasgadams avatar May 24 '23 20:05 lucasgadams

I have the same question like u. I am also wondering how does "gen" works.

  • If each time guidance use "gen" , and it calls LLM and put all the text before "gen" into LLM, then it would be really costy.
  • And if so, it seems like a linear progress. How do guidance achieve "Guidance Acceleration"? (From Readme: This means Guidance only asks the LLM to generate the green text below, not the entire program. This cuts this prompt's runtime in half vs. a standard generation approach.)

Tom-0727 avatar May 25 '23 02:05 Tom-0727

@Tom-0727 from what I guess, if you have access to the underlying model (i.e its an open source model run through Transformers library) you can cache the prompt more or less and then you get a "speedup" where it becomes something closer to just straight generation. But for API based models you cant, and thus this library seems like not such a good idea for those.

lucasgadams avatar May 25 '23 12:05 lucasgadams

Great questions and we will expand docs when we can to cover this.

Basically you get the same performance you normally get for API endpoints that don't yet support guidance (and yes, chaining lots of calls to gen is clean and easy, but comes with the same performance impacts you get with any other library). But for endpoints that do support guidance directly (currently just transformers open models, soon llama.cpp, ...) then we can reuse and smartly batch the KV cache. This is actually much faster than straight generation since you can pay "prompt token cost" for all the known parts of the prompt (which are much faster than generation tokens).

If you are using traditional endpoints then you mostly just get nice select forcing, stop_regex support, streaming development in your notebook, etc. Not performance acceleration (and be careful about making long chains of calls or you will end up with a performance hit). But looking forward there is no technical reason why guidance can't help instigate a new a richer set of API endpoints :). ...I'll post in discussions when the guidance server code lands for showing how such endpoints can work.

slundberg avatar May 25 '23 19:05 slundberg

Do you have some inside knowledge that API based LLMs will support something like this in the future? That would definitely change prompting techniques. While there is no technical reason why it couldn't be done, there may be practical reasons that OpenAI for instance never supports this. But I'd be very interested to understand whether this is the direction that they will likely take.

lucasgadams avatar May 25 '23 19:05 lucasgadams

@lucasgadams good question, I was implying nothing based on inside knowledge. Just based on usefulness.

slundberg avatar May 25 '23 20:05 slundberg

Do you have some inside knowledge that API based LLMs will support something like this in the future? That would definitely change prompting techniques. While there is no technical reason why it couldn't be done, there may be practical reasons that OpenAI for instance never supports this. But I'd be very interested to understand whether this is the direction that they will likely take.

Technically, I think it's because the API backend computes all prompt requests in bulk simultaneously, making it difficult to implement pauses and interruptions in between; But openAI will definitely charge less for doing this...

luo-li-ba-suo avatar Jul 12 '23 07:07 luo-li-ba-suo

But for endpoints that do support guidance directly (currently just transformers open models, soon llama.cpp, ...) then we can reuse and smartly batch the KV cache.

@slundberg could you elaborate on what "support guidance directly" means?

hyusetiawan avatar Sep 27 '23 00:09 hyusetiawan

In the new release, we try to explain this here. Hopefully it's more clear now, if not please reopen :)

marcotcr avatar Nov 14 '23 21:11 marcotcr