guidance Provide an HTTP API layer

This projects looks amazing, and I'm very interested to trying it out in my projects. However, I usually run models remotely, which means I need a way to integrate my application with it without running everything on the same machine.

For this purpose, an HTTP API is extremely practical, since I can host it on a Docker image on RunPod for example, or on a dedicated machine.

Since the API of Guidance seems straightforward, I would definitely see a websockets implementation being a great match, since there might be back and forth between the prompt generation and user input (obviously with major implications as the inference would be using the server's VRAM while waiting, but in some cases that's better than feeding back the whole prompt every time).

In a perfect world we'd get it all, but any of those options would be helpful:

A simple HTTP layer to load a model and run inferences (no support for await)
A websocket API (support for await)
A docker image that has all of this already configured and ready to go, so users can try guidance easily locally or on a GPU farm.

Note that I did not try guidance yet, because of the complexity of setting it up locally (options like text-generation-web-ui and kobold are great for quick experiments because they bundle the API and the model loading / inference, and have docker images out there)

Thanks for your consideration!

Jul 10 '23 00:07 acidbubbles

Just wrote the python + spent way to long battling docker and python dependencies to get this working.

Its far from fully featured but it suits our use-case well enough for the moment.

if you have nvidia-container-toolkit it runs out of the box.

https://github.com/utilityai/guidance-rpc

Aug 22 '23 20:08 MarcusDunn

@MarcusDunn the link is not working. Can you please re-upload the docker image?

Mar 28 '24 15:03 muaid-mughrabi

Unfortunately, that's no longer maintained (I just archived and re-made public) I would highly recommend not using it - there was a period of time where guidance was not maintained during which we built our own (closed source) inference server on top of https://github.com/utilityai/llama-cpp-rs which allows us to do what we needed.

Mar 28 '24 18:03 MarcusDunn