guidance Batched Inference to Improve GPU Utilisation

Is your feature request related to a problem? Please describe. When using this library in a loop, I am getting poor GPU Utilisation running zephyr-7b.

Describe the solution you'd like It would be fantastic to be able to pass a list of prompts to a function of the Transformers class, and define a batch size like you can for a huggingface pipeline. This significantly improves speed and GPU utilisation.

Additional context GPU utilisation for reference:

Dec 02 '23 07:12 lachlancahill

I feel like this is very important, if they don't implement batch inferencing I can't really consider it over llama.cpp's GBNF grammers.

Dec 10 '23 06:12 drachs

+1

@drachs does GBNF in ggml support batched inference with different grammar constraints per generation in the batch? is that even possible? would love some guidance, if you please.

Dec 13 '23 21:12 darrenangle

I'm not very strong on the theory, but llama.cpp does support continuous batch inference with a grammar file. It had grammar support and continuous batching support for a while, but my understanding is it didn't start working together until this PR, maybe some clues in there: https://github.com/ggerganov/llama.cpp/pull/3624

You can try it out yourself, here are some instructions on how to use this with Docker from my notes. Note that the version in the public docker images doesn't work, I assume they must have been published prior to the fix in October.

git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile . docker run --gpus all -v .:/models -it --entrypoint /bin/bash local/llama.cpp:full-cuda ./parallel -m /models/ --grammar-file grammars/json.gbnf -t 1 -ngl 100 -c 8192 -b 512 -s 1 -np 10 -ns 128 -n 100 -cb

Dec 14 '23 06:12 drachs

FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries).

Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though.

Dec 19 '23 02:12 Jbollenbacher

FWIW I get like 95+% utilization when running inference on Mac Metal (specifically using Mistral-7b, used via repeated ollama REST API queries).

Speculating, I feel like this has something to do with memory bandwidth on the 3090 setup. Not sure though.

Thanks, that's interesting to know.

I think it's unlikely to be a memory bandwidth issue. The 3090 is 90-100% utilised when using the same model via huggingface transformers (with much better throughput).

To speculate myself, I'm expecting the issue could be that much of the processing done in this library is CPU bound, so when running in a loop, the GPU is waiting while the CPU bound processes are being performed, then the CPU waits while the GPU inference is being run. This is why it would be great to see an implementation of batch inference, so that while the CPU is processing the output of the first item, the GPU can begin running inference on the next. That way, they aren't waiting for each other to finish and can work at the same time.

Dec 21 '23 22:12 lachlancahill

:+1: Batch inference would greatly unlock synthetic data.

edit: in the meantime, outlines offers constrained gen and batch inference.

Jan 07 '24 21:01 freckletonj

Any idea on how to perform batch inference? This applies especially in the context of applying guidance to many data in parallel.

Apr 03 '24 12:04 CarloNicolini

guidance guidance copied to clipboard

Batched Inference to Improve GPU Utilisation

guidance
guidance copied to clipboard