Improved initial Outlines experience
There's a few areas where the initial experience of using Outlines is a little clunky.
Installing Outlines
For one, none of the inference backends are dependencies, which means people are usually confronted by a series of "please install X", "please install Y", etc.
Eric Ma has a Gist about this, which highlights a series of mild irritations as the user's very first interactions with Outlines. Ideally, there should be a set of install options that provide canned or default options for users.
My proposal is to add llama_cpp as a default install, and provide a series of optional dependencies by inference engine
# prepackaged with llamacpp
pip install outlines
# One for each engine
pip install 'outlines[vllm]'
pip install 'outlines[transformers]'
pip install 'outlines[all]' # gives you everything
EDIT: we already support the serve extra, so it should be relatively straightforward to include other methods.
Docs
The docs should also be updated to include examples for a handful of different backends in tabs, so that people can enter their preferred inference mode and have all the docs reflect their choice.
It may also be able to allow users to select a model and preferred quantization method so they can always copy/paste the code in whatever mode they're working in.
Models to use and quantization
There's lots of annoying little edge cases with models. Some of the docs still refer to gated models. These are generally pretty reliable, but it requires you to set an HF token which many users may not have already done.
We should make it very clear what models are recommended as intro tools, and show clearly how to quantize them for resource-constrained users.
Any other things I missed?
I agree on all points. You can split this in smaller issues, and people can keep sharing their grievances here :)
Good issue, I've run into all of these problems.
I disagree about llama.cpp though, there's no reason to by default include llama-cpp-python in downstream dependents such as vLLM. Additionally, llama-cpp-python is the Outlines-supported model I've run into the most issues installing / building.
We should prominently document that users can install outlines[llamacpp], outlines[transformers], etc, rather than recommending pip install outlines
I disagree about
llama.cppthough, there's no reason to by default includellama-cpp-pythonin downstream dependents such as vLLM.
We should prominently document that users can install
outlines[llamacpp],outlines[transformers], etc, rather than recommendingpip install outlines
This is fine with me!
Additionally,
llama-cpp-pythonis the Outlines-supported model I've run into the most issues installing / building.
I've had this happen as well. Lots of tiny issues with llamacpp. This is a bigger issue IMO -- llamacpp support should definitely be more robust to support the various hardware-constrained Outlines users.
As long as major inference libraries import outlines we shouldn't include any library by default. However when the structured generation logic is in a separate library I am ready to consider having llama.cpp as a default.
Coming from pure pip install outlines (it didn't prompt me to install anything else) it took 1hr+ to generate 512 tokens constrained to a r"```latex(.*?|\n)```" regex. The FSM compiled to 100% fairly fast though. This was a 2B-4bit model on a 3090, of which all 24GB of VRAM were filled during generation (my prompt is like 20 tokens).
I had a similar experience in the past, which I "solved" by using HuggingFace's TGI for structured generation. It was a lot faster, which is weird because I thought they used Outlines under the hood.
Should I have went with an inference engine like pip install outlines[vllm]?
@ahmed-moubtahij Yes, outlines only becomes the bottleneck after ~1,000 tokens/s, and vllm is substantially faster than transformers
However, are you sure you set device_map="cuda"? Sounds like you might have been generating on CPU.
@lapp0 Yeah I already had device_map set to cuda. Its entire VRAM was occupied.
This is what my code looks like:
model = outlines.models.transformers(
"unsloth/gemma-2-2b-it-bnb-4bit"
tokenizer_kwargs={"trust_remote_code": True},
model_kwargs={
"torch_dtype": torch.float16,
"device_map": "cuda",
"attn_implementation": "flash_attention_2",
}
)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_special_tokens=False)
generator = outlines.generate.regex(model, r"<latex>(.|\n)*?</latex>")
response = generator(prompt, max_tokens=1024)
I'll re-try with pip install outlines[vllm]
UPDATE: Silly me, it didn't automagically speed up things, I need to actually launch a vllm endpoint. For now, I'll just work around this problem.
Do you mind moving the resolution of this problem to a new issue? 🙂
Do you mind moving the resolution of this problem to a new issue? 🙂
Sure, how do I accomplish that? I thought of referencing my messages in a new issue and deleting from here but I think that would cancel the reference.
You don't have to delete the messages, we'll just hide them here!
Done: https://github.com/dottxt-ai/outlines/issues/1167
+1 to this - I shouldn't need to install torch if I'm just using outlines for the mlx backend. It seems like this is partially implemented (e.g., deferred importing/catching import errors here, but current state is you can't even import outlines because other modules don't follow this pattern