jsonformer icon indicating copy to clipboard operation
jsonformer copied to clipboard

Add integer and enum types

Open Ryul0rd opened this issue 1 year ago • 4 comments

This PR adds 2 new types and enforces their correct generation:

  • Integers: These are treated as json numbers but will never contain a "." character so can be parsed as ints. (Regular numbers will practically always have a "." character currently. Is this intended?)
  • Enums: These are treated as json strings but will only ever be one of several options specified in the schema.

Example schema:

car = {
    "type": "object",
    "properties": {
        "make": {"type": "string"},
        "model": {"type": "string"},
        "year": {"type": "integer"},
        "color": {
            "type": "enum",
            "values": ["red", "green", "blue", "brown", "white", "black"],
        },
    },
}

I also went ahead and fixed the issue described in the todo in the generate bool method since it was an easy fix.

BTW, the performance is much better since I last checked up on this library so great job on that!

Ryul0rd avatar May 15 '23 07:05 Ryul0rd

Everything seems to be working now. One of the issues was actually a problem with both integer generation and number generation so that bug is also fixed. The model wasn't actually being allowed to generate a comma, which would actually be the correct way to terminate a JSON number. The result was that the numbers would just go to the max allowed length in most cases.

Ryul0rd avatar May 18 '23 09:05 Ryul0rd

Hey I wanted to ask about the performance issues and see if there is any way I can help. I am running this:

state = {
    "type": "object",
    "properties": {
        "state": {
            "type": "enum",
            "values": ["CA", "WA", "VA", "PA", "NY"],
        },
    },
}

builder = Jsonformer(
    model=model,
    tokenizer=tokenizer,
    json_schema=state,
    prompt="Please generate a JSON for the state PA: ",
    max_string_token_length=20,
)

print("Generating...")
output = builder()

highlight_values(output)

And getting this result:

Generating... { state: "CA" }

This is happening when I run with num_beams=10 in the generate method.

It feels strange that the model is struggling on simple cases like this. Is this an issue of model performance? Or perhaps something to do with the way we are masking tokens being too restrictive? Let me know your thoughts I have been stuck on this for a couple days trying to implement an SSN format and it really starts to struggle on comparable seemingly simple tasks.

JamesHill0 avatar Jun 05 '23 17:06 JamesHill0

@JamesHill0 What mode are you using? It's a bit hard to say if that's the issue without knowing what the model is. All jsonformer can do is guarantee you get valid output and you did. num_beams also isn't going to do anything here because we already treat each option like a beam. You could try other models or another similar library like guidance or LMQL and see if either of those work better.

Ryul0rd avatar Jun 06 '23 03:06 Ryul0rd

I merged it in this branch, where I added probabilities too https://github.com/wassname/prob_jsonformer

also I made a list of other libs here https://github.com/wassname/awesome-interpretability/tree/main?tab=readme-ov-file#structured-output

wassname avatar May 10 '24 12:05 wassname