outlines icon indicating copy to clipboard operation
outlines copied to clipboard

Outlines v1 response model validation

Open cpfiffer opened this issue 8 months ago • 4 comments

In Outlines v1, we specify an output format with model(prompt, OutputClass, ...).

The current behavior of this is to provide a JSON string, rather than the validated model class OutputClass.

Example code:

import json
from outlines import models
from pydantic import BaseModel, Field
from transformers import AutoModelForCausalLM, AutoTokenizer

class Person(BaseModel):
    name: str
    age: int
    email: str = Field(pattern=r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')


model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"
model = models.from_transformers(
    AutoModelForCausalLM.from_pretrained(model_id),
    AutoTokenizer.from_pretrained(model_id)
)

person_text = """
John Doe
30
[email protected]
"""

result = model(
    f"Extract the person information from this text:\n{person_text}", 
    Person,
    max_new_tokens=100
)
print(json.dumps(result, indent=2))

The output type is a JSON string:

{ "name": "John Doe", "age": 30, "email": "[email protected]" }

My expectation here would be that model return Person object, rather than the raw string. This would look like

Person(name='John Doe', age=30, email='[email protected]')

To fix this, I currently have to do the (very simple) extra line

person = Person.model_validate_json(result)

Is this intended behavior? Outlines < 1.0 would typically return a Pydantic object back.

cpfiffer avatar Apr 08 '25 19:04 cpfiffer

Afair this was not intended, but an oversight on our end. Since the output type is available where the text is generated it should not be too difficult to add, except maybe for union types. We could add it just for Pydantic output types and other simple types for now.

rlouf avatar Apr 15 '25 20:04 rlouf

I was aware of it and thought it was intended.

I'm not not too keen on putting it back as I believe it paradoxically ends up making the life of the user harder. I think so as it means they have to know for each output type the associated return format, considering that some may not be fully intuitive (int returns an int, but the regex for an int returns a string) and that it may not be implemented in some cases (Union, but also things like booleans in enum or Literal and probably other cases hard to anticipate).

As a user, I prefer being told I'll always get a string and handling myself turning it back into what I need (it typically requires a single line)

RobinPicard avatar Apr 16 '25 10:04 RobinPicard

I think you're right about the fact that it would get confusing.

rlouf avatar Apr 16 '25 11:04 rlouf

I think we should think of passing in a Pydantic class as a special case (and possibly dict). Here, users have explicitly provided an output type. I agree that any other situation should return a string.

The reason for this is that a large portion of calls to Outlines use Pydantic classes. We want to make that as seamless as possible without requiring the user to understand model_validate_json and use it after every single call to the generator.

I don't think this is confusing behavior at all -- from my perspective, it's entirely intuitive to provide an object class and then get that same object class out.

Similarly, if I provide a dict, I would expect a dict. If I provide a string (regex, raw schema), I would expect a string. If anything, it's unintuitive to have all requests cast result types to strings.

cpfiffer avatar Apr 16 '25 16:04 cpfiffer