`json` tool is inventing additional fields
The bug
What is the scope of guidance.json tool? I have tried it with very simple pydantic models and it works for them, but I get anomalous behavior when I nest models.
- How do I avoid generating extra fields?
- Is pydantic Field(..., description=...) accounted for during generation? If it doesn't it would be so cool if it did among other things like knowing the correct name of the schema (following example)
There was one instance where I used the following code -
class ExampleToolArgs(BaseModel):
model_config = ConfigDict(extra="ignore")
a: int = Field(..., description="The 1st argument")
b: int = Field(..., description="The 2nd argument")
c: str = Field(..., description="The 3rd argument")
class DictStrIntArg(BaseModel):
a: int = Field(..., description="The 1st argument")
b: int = Field(..., description="The 2nd argument")
class PascalToolArgs(BaseModel):
model_config = ConfigDict(extra="ignore")
a: int = Field(..., description="The 1st argument")
b: int = Field(..., description="The 2nd argument")
c: DictStrIntArg = Field(
..., description="Dictionary with 2 keys of any name and any integer value"
)
d: str = Field(..., description="The 4th argument")
class ToolCall(BaseModel):
model_config = ConfigDict(extra="ignore")
tool_name: str = Field(..., description="The name of the tool to call")
tool_args: ExampleToolArgs | PascalToolArgs = Field(
..., description="The arguments to the tool"
)
class ToolUsage(BaseModel):
tool_calls: list[ToolCall] = Field(..., description="The calls to the tools")
lmx = lm.copy()
with g.user():
lmx += "Please generate 3 tool calls to the pascal tool and 2 to example tool. DO NOT generate ANY argument that is not specified in the schema."
with g.assistant():
lmx += json("calls", schema=ToolUsage, temperature=0.2)
pprint(j.loads(lmx["calls"]))
and got this,
It never seems to understand the dictionary part for Pascaltool and moreover forgets about the parameter 'd'.
To Reproduce ⬆️
System info (please complete the following information):
- OS (e.g. Ubuntu, Windows 11, Mac OS, etc.): WSL Ubuntu
- Guidance Version (
guidance.__version__): 0.2.0
Great call @adityaprakash-work. We've taken a stance so far of sticking very strictly to the JSON schema spec, which allows for additional fields unless you explicitly specify "additionalProperties": false in the schema defintion.
You can change this (I think, can't test atm) by creating a ConfigDict and setting extra="forbid":
from pydantic import BaseModel, ConfigDict
class MyModel(BaseModel):
model_config = ConfigDict(extra="forbid") # Prevent extra fields
name: str
age: int
# Generate JSON Schema
json_schema = MyModel.model_json_schema()
import json
print(json.dumps(json_schema, indent=2))
That said... I think this is bad guidance library behavior, because most users don't know that JSON schema by default allows extra fields. This is a good push for us to discuss changing our default behavior.
Tagging @hudson-ai and @julian for further thoughts :) I think the field descriptions are another good area to discuss...we really should think about passing them by default. I wonder if we can have a kwarg on the json function to talk about opinionated vs. strict adherence to json schema semantics, defaulted to what is most natural for AI users
(Maybe surprisingly, I'm not a pydantic expert! so I'd have to look closer to have a real opinion on what seems right) -- but the one thing I can say I'm aware of immediately is that I think pydantic is still working out what it wants its own defaults to be for some cases here -- specifically I've seen pydantic/pydantic#10785 cross my eye before. I think they asked me to offer an opinion somewhere on this, but I can't find it at the minute...
model_config = ConfigDict(extra="forbid") works 👍 (minor issue, but extra="forbid" did not prevent extra keys for Mock model).
Also, is the supplied schema necessarily respected?
c wasn't supposed to be empty I guess.