outlines syntax error when using Optional[Tuple[int,int]]]

Describe the issue as clearly as possible:

When using the outlines library to generate JSON structures with the google/gemma-1.1-2b-it and solidrust/Hermes-2-Pro-Llama-3-8B-AWQ models, the JSON output is often invalid, leading to failures in downstream processes. This issue is frequently associated with InvalidSyntax errors during regex pattern parsing.

Steps/code to reproduce the bug:

Initialize the Sampling Parameters and Load the Model:

from vllm.sampling_params import SamplingParams
from outlines.transformers.vllm import VLLM
from huggingface_hub import snapshot_download
from vllm import LLM
from tempfile import TemporaryDirectory

# Set the sampling parameters for the model
samplimg_params = SamplingParams(max_tokens=2048)

# Define token and model path
token = "your_hf_token"

# Load google/gemma-1.1-2b-it model
model_name = "google/gemma-1.1-2b-it"
with TemporaryDirectory() as tmp_dir:
    snapshot_download(repo_id=model_name, local_dir=tmp_dir, token=token)
    model_gemma = LLM(model=tmp_dir, tokenizer=tmp_dir, trust_remote_code=True)

# Load solidrust/Hermes-2-Pro-Llama-3-8B-AWQ model
model_name = "solidrust/Hermes-2-Pro-Llama-3-8B-AWQ"
with TemporaryDirectory() as tmp_dir:
    snapshot_download(repo_id=model_name, local_dir=tmp_dir, token=token)
    model_solidrust = LLM(model=tmp_dir, tokenizer=tmp_dir, trust_remote_code=True, dtype="auto", quantization="awq")

# Wrap the models with VLLM
vllm_model_gemma = VLLM(model_gemma)
vllm_model_solidrust = VLLM(model_solidrust)

Note: This code downloads the model to a local temporary directory and loads it. This simulates uploading to S3 and then downloading it, as the end result is the same.

Prepare the Prompts:

prompts = [extract_job_description_summary(job['title'], job['description']) for job in df[['title', 'description']].to_dict(orient='records')]

Generate JSON:

generator = generate.json(vllm_model_gemma, JobDescriptionSummary, whitespace_pattern="[ \n\t]?")
results = generator(prompts, sampling_params=samplimg_params)

Expected result:

The model should generate valid JSON objects conforming to the schema defined by JobDescriptionSummary.

Error message:

The output JSON is often invalid, containing syntax errors that prevent proper parsing and downstream processing.

Error Details

For the google/gemma-1.1-2b-it model:

NoMatch: Can not match at index 4072. Got ')?([ ', expected any of ['*', '+', '?', '{', '(', '[', '\\', '.', '$', '^', "<Any 1 except ('\\\\', '$', '|', '?', '+', '.', '^', ')', '[', '(', '*')>", '|'].
Context(data[-10:+10]): ']?\\]|null))?([ \n\t]?,'

For the solidrust/Hermes-2-Pro-Llama-3-8B-AWQ model:

NoMatch: Can not match at index 4072. Got ')?([ ', expected any of ['*', '+', '?', '{', '(', '[', '\\', '.', '$', '^', "<Any 1 except ('\\\\', '$', '|', '?', '+', '.', '^', ')', '[', '(', '*')>", '|'].
Context(data[-10:+10]): ']?\\]|null))?([ \n\t]?,'

Outlines/Python version information:

Version information

* python: 3.10.4 * outlines: 0.0.41

Context for the issue:

Gemma model on Hugging Face: google/gemma-1.1-2b-it
Hermes model on Hugging Face: solidrust/Hermes-2-Pro-Llama-3-8B-AWQ

May 19 '24 20:05 hugocool

I've encountered an interesting and perplexing behavior while working with the solidrust/Hermes-2-Pro-Llama-3-8B-AWQ model using the outlines library. When loading the model directly from the Hugging Face Hub, I encountered a CUDA out-of-memory (OOM) error, whereas loading the same model from disk previously resulted in a syntax error. Below are the details of the observations and the code used.

Observation

When loading the solidrust/Hermes-2-Pro-Llama-3-8B-AWQ model directly from Hugging Face Hub:

OOM Error:

OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 21.99 GiB of which 21.00 MiB is free. Process 25933 has 21.95 GiB memory in use. Of the allocated memory 20.13 GiB is allocated by PyTorch, and 373.51 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Syntax Error when loading from disk:

NoMatch: Can not match at index 4072. Got ')?([ ', expected any of ['*', '+', '?', '{', '(', '[', '\\', '.', '$', '^', "<Any 1 except ('\\\\', '$', '|', '?', '+', '.', '^', ')', '[', '(', '*')>", '|'].
Context(data[-10:+10]): ']?\\]|null))?([ \n\t]?,'

Steps/code to reproduce the OOM error:

from vllm import LLM
from vllm.sampling_params import SamplingParams
from outlines.transformers.vllm import VLLM
import pandas as pd

# Initialize the model
model_name = "solidrust/Hermes-2-Pro-Llama-3-8B-AWQ"

llm = LLM(
    model=model_name,
    tokenizer=model_name,
    trust_remote_code=True,
    dtype="auto",
    quantization="awq"
)

# Set the sampling parameters for the model
samplimg_params = SamplingParams(max_tokens=2048)

# Wrap the model with VLLM
model = VLLM(llm)

# Prepare the prompts
prompts = [extract_job_description_summary(job['title'], job['description']) for job in df[['title', 'description']].to_dict(orient='records')]

# Generate JSON
generator = generate.json(model, JobDescriptionSummary, whitespace_pattern="[ \n\t]?")
results = generator(prompts, sampling_params=samplimg_params)

# Extract the results into a DataFrame
data = [model.dict() for model in results]
extracted_texts_df = pd.DataFrame(data)

result:

The OOM error occurs during the model loading phase when loading directly from the Hugging Face Hub, whereas a syntax error occurs when loading the model from disk.

May 19 '24 23:05 hugocool

okay, after a lot of testing the issue boils down to the Tuple contraint in the Basemodel.

from datetime import datetime
import json
from enum import Enum
from typing import List,Optional
from typing_extensions import List

from pydantic import BaseModel, constr
import interegular

import outlines.models as models
from outlines.fsm.json_schema import build_regex_from_schema
from outlines.integrations.utils import adapt_tokenizer, convert_json_schema_to_str

import pandas as pd 
from pydantic import BaseModel, Field, conlist, constr
from outlines import models, prompt, generate
from typing import Annotated, Tuple, List, Optional

from pydantic import BaseModel, StringConstraints


class JobDescriptionSummary(BaseModel):
    salary_range: Optional[Tuple[int]] = Field(
        default=None,
        description="Salary range for the job, represented as a tuple of (min_salary, max_salary) in integers."


regex_str = build_regex_from_schema(json.dumps(JobDescriptionSummary.model_json_schema()))
regex_pattern = interegular.parse_pattern(regex_str)

returns:

NoMatch: Can not match at index 4072. Got ')?([\\', expected any of ['*', '+', '?', '{', '(', '[', '\\', '.', '$', 
'^', "<Any 1 except ('(', '\\\\', '*', '^', '+', ')', '$', '[', '.', '|', '?')>", '|'].
Context(data[-10:+10]): ']*\\]|null))?([\\n ]*,'

May 20 '24 12:05 hugocool

Pinging @lapp0

May 22 '24 08:05 rlouf

Thanks for the great reproduction scripts and isolation of the problem @hugocool!

You can verify the fix with

pip install "git+https://github.com/lapp0/outlines@fix-905"

Issue details

The issue is that Tuple, defined with prefixItems, is not handled.

https://docs.pydantic.dev/latest/api/json_schema/#pydantic.json_schema.GenerateJsonSchema.tuple_schema

Your json schema:

{'properties': {'salary_range': {'anyOf': [{'maxItems': 1, 'minItems': 1, 'prefixItems': [{'type': 'integer'}], 'type': 'array'}, {'type': 'null'}], 'default': None, 'description': 'Salary range for the job, represented as a tuple of (min_salary, max_salary) in integers.', 'title': 'Salary Range'}}, 'title': 'JobDescriptionSummary', 'type': 'object'}

Smoke test:

class JobDescriptionSummary(BaseModel):
    salary_range: Optional[Tuple[int]] = Field(
        default=None,
        description="Salary range for the job, represented as a tuple of (min_salary, max_salary) in integers."
)
model = outlines.models.transformers("microsoft/phi-2")

generator = outlines.generate.json(model, JobDescriptionSummary, whitespace_pattern="")
job = generator("Give me a job description in json format:\n")
print(repr(job))

Output:

JobDescriptionSummary(salary_range=None)

JobDescriptionSummary(salary_range=(1080,))

May 22 '24 22:05 lapp0