outlines Improved initial Outlines experience

There's a few areas where the initial experience of using Outlines is a little clunky.

Installing Outlines

For one, none of the inference backends are dependencies, which means people are usually confronted by a series of "please install X", "please install Y", etc.

Eric Ma has a Gist about this, which highlights a series of mild irritations as the user's very first interactions with Outlines. Ideally, there should be a set of install options that provide canned or default options for users.

My proposal is to add llama_cpp as a default install, and provide a series of optional dependencies by inference engine

# prepackaged with llamacpp
pip install outlines 

# One for each engine
pip install 'outlines[vllm]'
pip install 'outlines[transformers]'
pip install 'outlines[all]' # gives you everything

EDIT: we already support the serve extra, so it should be relatively straightforward to include other methods.

Docs

The docs should also be updated to include examples for a handful of different backends in tabs, so that people can enter their preferred inference mode and have all the docs reflect their choice.

It may also be able to allow users to select a model and preferred quantization method so they can always copy/paste the code in whatever mode they're working in.

Models to use and quantization

There's lots of annoying little edge cases with models. Some of the docs still refer to gated models. These are generally pretty reliable, but it requires you to set an HF token which many users may not have already done.

We should make it very clear what models are recommended as intro tools, and show clearly how to quantize them for resource-constrained users.

Any other things I missed?

Sep 13 '24 21:09 cpfiffer

I agree on all points. You can split this in smaller issues, and people can keep sharing their grievances here :)

Sep 13 '24 22:09 rlouf

Good issue, I've run into all of these problems.

I disagree about llama.cpp though, there's no reason to by default include llama-cpp-python in downstream dependents such as vLLM. Additionally, llama-cpp-python is the Outlines-supported model I've run into the most issues installing / building.

We should prominently document that users can install outlines[llamacpp], outlines[transformers], etc, rather than recommending pip install outlines

Sep 14 '24 21:09 lapp0

I disagree about llama.cpp though, there's no reason to by default include llama-cpp-python in downstream dependents such as vLLM.

We should prominently document that users can install outlines[llamacpp], outlines[transformers], etc, rather than recommending pip install outlines

This is fine with me!

Additionally, llama-cpp-python is the Outlines-supported model I've run into the most issues installing / building.

I've had this happen as well. Lots of tiny issues with llamacpp. This is a bigger issue IMO -- llamacpp support should definitely be more robust to support the various hardware-constrained Outlines users.

Sep 16 '24 02:09 cpfiffer

As long as major inference libraries import outlines we shouldn't include any library by default. However when the structured generation logic is in a separate library I am ready to consider having llama.cpp as a default.

Sep 16 '24 06:09 rlouf

Coming from pure pip install outlines (it didn't prompt me to install anything else) it took 1hr+ to generate 512 tokens constrained to a r"```latex(.*?|\n)```" regex. The FSM compiled to 100% fairly fast though. This was a 2B-4bit model on a 3090, of which all 24GB of VRAM were filled during generation (my prompt is like 20 tokens).

I had a similar experience in the past, which I "solved" by using HuggingFace's TGI for structured generation. It was a lot faster, which is weird because I thought they used Outlines under the hood.

Should I have went with an inference engine like pip install outlines[vllm]?

Sep 16 '24 21:09 ahmed-moubtahij

@ahmed-moubtahij Yes, outlines only becomes the bottleneck after ~1,000 tokens/s, and vllm is substantially faster than transformers

However, are you sure you set device_map="cuda"? Sounds like you might have been generating on CPU.

Sep 16 '24 21:09 lapp0

@lapp0 Yeah I already had device_map set to cuda. Its entire VRAM was occupied.

This is what my code looks like:

model = outlines.models.transformers(
    "unsloth/gemma-2-2b-it-bnb-4bit"
    tokenizer_kwargs={"trust_remote_code": True},
    model_kwargs={
        "torch_dtype": torch.float16,
        "device_map": "cuda",
        "attn_implementation": "flash_attention_2",
  }
)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_special_tokens=False)
generator = outlines.generate.regex(model, r"<latex>(.|\n)*?</latex>")
response = generator(prompt, max_tokens=1024)

I'll re-try with pip install outlines[vllm]

UPDATE: Silly me, it didn't automagically speed up things, I need to actually launch a vllm endpoint. For now, I'll just work around this problem.

Sep 16 '24 23:09 ahmed-moubtahij

Do you mind moving the resolution of this problem to a new issue? 🙂

Sep 17 '24 08:09 rlouf

Do you mind moving the resolution of this problem to a new issue? 🙂

Sure, how do I accomplish that? I thought of referencing my messages in a new issue and deleting from here but I think that would cancel the reference.

Sep 17 '24 17:09 ahmed-moubtahij

You don't have to delete the messages, we'll just hide them here!

Sep 17 '24 21:09 rlouf

Done: https://github.com/dottxt-ai/outlines/issues/1167

Sep 21 '24 11:09 ahmed-moubtahij

+1 to this - I shouldn't need to install torch if I'm just using outlines for the mlx backend. It seems like this is partially implemented (e.g., deferred importing/catching import errors here, but current state is you can't even import outlines because other modules don't follow this pattern

Jan 30 '25 15:01 dconathan