jsonformer
jsonformer copied to clipboard
Add integer and enum types
This PR adds 2 new types and enforces their correct generation:
- Integers: These are treated as json numbers but will never contain a "." character so can be parsed as ints. (Regular numbers will practically always have a "." character currently. Is this intended?)
- Enums: These are treated as json strings but will only ever be one of several options specified in the schema.
Example schema:
car = {
"type": "object",
"properties": {
"make": {"type": "string"},
"model": {"type": "string"},
"year": {"type": "integer"},
"color": {
"type": "enum",
"values": ["red", "green", "blue", "brown", "white", "black"],
},
},
}
I also went ahead and fixed the issue described in the todo in the generate bool method since it was an easy fix.
BTW, the performance is much better since I last checked up on this library so great job on that!
Everything seems to be working now. One of the issues was actually a problem with both integer generation and number generation so that bug is also fixed. The model wasn't actually being allowed to generate a comma, which would actually be the correct way to terminate a JSON number. The result was that the numbers would just go to the max allowed length in most cases.
Hey I wanted to ask about the performance issues and see if there is any way I can help. I am running this:
state = {
"type": "object",
"properties": {
"state": {
"type": "enum",
"values": ["CA", "WA", "VA", "PA", "NY"],
},
},
}
builder = Jsonformer(
model=model,
tokenizer=tokenizer,
json_schema=state,
prompt="Please generate a JSON for the state PA: ",
max_string_token_length=20,
)
print("Generating...")
output = builder()
highlight_values(output)
And getting this result:
Generating... { state: "CA" }
This is happening when I run with num_beams=10 in the generate method.
It feels strange that the model is struggling on simple cases like this. Is this an issue of model performance? Or perhaps something to do with the way we are masking tokens being too restrictive? Let me know your thoughts I have been stuck on this for a couple days trying to implement an SSN format and it really starts to struggle on comparable seemingly simple tasks.
@JamesHill0 What mode are you using? It's a bit hard to say if that's the issue without knowing what the model is. All jsonformer can do is guarantee you get valid output and you did. num_beams also isn't going to do anything here because we already treat each option like a beam. You could try other models or another similar library like guidance or LMQL and see if either of those work better.
I merged it in this branch, where I added probabilities too https://github.com/wassname/prob_jsonformer
also I made a list of other libs here https://github.com/wassname/awesome-interpretability/tree/main?tab=readme-ov-file#structured-output