guidance
guidance copied to clipboard
Question about batched inference and async_mode
Please forgive me if I am asking a stupid question. I am trying to infer on a lot of data, and I think batched inference would speed-up a lot. As the demo here seems only display how to perform single-batch inference, I modify the codes a bit to support batched inference as follow:
import time
import torch
import guidance
import asyncio
llm = guidance.llms.Transformers("gpt2-large", device="cuda")
template = """{{prefix}}{{gen 'story' stop="." max_tokens=50}}."""
program = guidance(template, llm=llm, async_mode=True)
prompt = "{num} ducks are crossing the bridge, suddenly "
nums = [str(i) for i in range(100)]
prefixes = [prompt.format(num=num) for num in nums]
## The following are the batched inference part
loop = asyncio.new_event_loop()
async def call_async(prefix):
await program(prefix=prefix)
tasks = []
for prefix in prefixes:
tasks.append(loop.create_task(call_async(prefix=prefix)))
results = loop.run_until_complete(asyncio.wait(tasks, timeout=500))
However, the inference time is almost the same as performing a naive loop over the list and infer one sample every time:
## The following are the naive single-batch inference
results = []
for prefix in prefixes:
results.append(program(prefix=prefix))
Am i misunderstanding something? Why asynchronous batched inference can not make any speed-up?
Your parallelization should be done under GPU-level, not system-level. The GPU is blocked until it finishes each inference. You should change the guidance function in a manner that it can call the generate with a batch of data.
Your parallelization should be done under GPU-level, not system-level. The GPU is blocked until it finishes each inference. You should change the guidance function in a manner that it can call the generate with a batch of data.
Does anybody have an example of this with this library? I am very keen to get better GPU utilisation.
I need to understand this as well if it is possible.
@lachlancahill Were you able to find this out?
As I mentioned before, guidance cannot do this at the moment; you would need to rebuild a lot of guidance's source code to allow batched inference. My suggestion is to use: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md
PS: There's a discussion also in #493 about this.
@lachlancahill Were you able to find this out?
Not with this library, but the outlines library does support batch inference and I am seeing great results with it. I also find it's more intuitive to use (eg: guidance's strange use of the add() as its main inference method).