guidance icon indicating copy to clipboard operation
guidance copied to clipboard

Question about batched inference and async_mode

Open HuihuiChyan opened this issue 1 year ago • 6 comments

Please forgive me if I am asking a stupid question. I am trying to infer on a lot of data, and I think batched inference would speed-up a lot. As the demo here seems only display how to perform single-batch inference, I modify the codes a bit to support batched inference as follow:

import time
import torch
import guidance
import asyncio

llm = guidance.llms.Transformers("gpt2-large", device="cuda")
template = """{{prefix}}{{gen 'story' stop="." max_tokens=50}}."""
program = guidance(template, llm=llm, async_mode=True)

prompt = "{num} ducks are crossing the bridge, suddenly "
nums = [str(i) for i in range(100)]
prefixes = [prompt.format(num=num) for num in nums]

## The following are the batched inference part
loop = asyncio.new_event_loop()

async def call_async(prefix):
    await program(prefix=prefix)

tasks = []
for prefix in prefixes:
    tasks.append(loop.create_task(call_async(prefix=prefix)))

results = loop.run_until_complete(asyncio.wait(tasks, timeout=500))

However, the inference time is almost the same as performing a naive loop over the list and infer one sample every time:

## The following are the naive single-batch inference
results = []
for prefix in prefixes:
    results.append(program(prefix=prefix))

Am i misunderstanding something? Why asynchronous batched inference can not make any speed-up?

HuihuiChyan avatar Aug 31 '23 07:08 HuihuiChyan

Your parallelization should be done under GPU-level, not system-level. The GPU is blocked until it finishes each inference. You should change the guidance function in a manner that it can call the generate with a batch of data.

jadermcs avatar Oct 19 '23 11:10 jadermcs

Your parallelization should be done under GPU-level, not system-level. The GPU is blocked until it finishes each inference. You should change the guidance function in a manner that it can call the generate with a batch of data.

Does anybody have an example of this with this library? I am very keen to get better GPU utilisation.

lachlancahill avatar Dec 02 '23 07:12 lachlancahill

I need to understand this as well if it is possible.

drachs avatar Dec 10 '23 07:12 drachs

@lachlancahill Were you able to find this out?

prats0599 avatar Feb 20 '24 19:02 prats0599

As I mentioned before, guidance cannot do this at the moment; you would need to rebuild a lot of guidance's source code to allow batched inference. My suggestion is to use: https://github.com/ggerganov/llama.cpp/blob/master/grammars/README.md

PS: There's a discussion also in #493 about this.

jadermcs avatar Feb 21 '24 17:02 jadermcs

@lachlancahill Were you able to find this out?

Not with this library, but the outlines library does support batch inference and I am seeing great results with it. I also find it's more intuitive to use (eg: guidance's strange use of the add() as its main inference method).

lachlancahill avatar Feb 23 '24 11:02 lachlancahill