[Bug] require latest 0.54.1 json-repair to respect python style number notation (underscore format number)
What happened?
Issue
My LLM starts to output numbers in underscore format like '{"value": 82_461_110}'.
When ChatAdapter call parse to handle the LLM completion, the parse_value function inside uses candidate = json_repair.loads(value), which will return {'value': 82}.
Fix in json-repair
Thanks to @mangiucugna, the author of json-repair, a new version 0.54.1 has released to fix this problem. https://github.com/mangiucugna/json_repair/issues/169 https://github.com/mangiucugna/json_repair/releases/tag/v0.54.1
I think if you can update the dependency that will help. Thanks!
Steps to reproduce
with json-repair==0.54.0
text = '{"value": 82_461_110}'
fixed = json_repair.loads(text)
fixed
{'value': 82}
with json-repair==0.54.1
text = '{"value": 82_461_110}'
fixed = json_repair.loads(text)
fixed
{'value': 82461110}
In dspy.adapters.utils, parse_value requires value and annotation.
If my annotation is a pydantic.Basemodel, json-repair will be called later candidate = json_repair.loads(value).
I think simply upgrade to 0.54.1 of json-repair will fix this issue.
DSPy version
3.0.3
@BoluoZz Thanks for reporting the issue!
The json repair behavior change makes sense to me, but could you share a reproducible code (including your module and LM)? We did see this behavior on string fields and made corresponding fixes, but haven't seen this on fields of pydantic types so far.
Sorry I am new to github so if you need any more details please let me know.
Because LLM returns response text in string format (whatever list or dict or other things LLM want to express), but the annotation is pydantic.Basevalue, so previous if condition (if annotation is str:, if isinstance(annotation, enum.EnumMeta):, origin = get_origin(annotation) if origin is Literal:, if not isinstance(value, str): will fail. Then code comes to candidate = json_repair.loads(value)
My code
# Number here is another pydantic.Basemodel class I used to normalize numbers (in different language, format, unit)
# Number requires two fields: value and unit
from enum import Enum
from typing import Any
from pydantic import BaseModel, field_validator
class NumbersUnit(Enum):
PERCENT = (0.01, '%', '%', '%')
ONE = (1, '', '', '')
K = (1000, '千', 'thousand', 'K')
M = (1000000, '百万', 'million', 'M')
B = (1000000000, '亿', 'billion', 'B')
def __init__(self, value, ch, en, en_short):
self.unit = value
self.ch = ch
self.en = en
self.en_short = en_short
class Number(BaseModel):
"""
value: a json number object if using LLM
unit: any value from NumbersUnit Enum is acceptable
'%' for percent unit
'' for one unit
one of '千', 'thousand', 'K' for thousands unit
one of '百万', 'million', 'M' for millions unit
one of '亿', 'billion', 'B' for billions unit
"""
value: float
unit: NumbersUnit
@field_validator('value', mode='before')
@classmethod
def convert_underscore_number(cls, v):
if isinstance(v, str) or isinstance(v, int) or isinstance(v, float):
try:
return float(v)
except Exception:
raise ValueError(f'Invalid value: {v}, can not be convert to float')
else:
raise ValueError(f'Invalid value: {v}')
@field_validator('unit', mode='before')
@classmethod
def pre_process_unit(cls, v):
if isinstance(v, NumbersUnit):
return v
# List/Tuple input, match Enum value
if isinstance(v, (list, tuple)):
for member in NumbersUnit:
if member.value == tuple(v):
return member
# String input: try name, en_short, ch, en (case-insensitive)
if isinstance(v, str):
val = v.strip()
for member in NumbersUnit:
if (
member.name.lower() == val.lower() or
member.en_short.lower() == val.lower() or
member.ch == val or
member.en.lower() == val.lower()
):
return member
raise ValueError(f'Invalid unit: {v}')
def model_dump(self, **kwargs) -> dict[str, Any]:
data = super().model_dump(**kwargs)
# Replace unit with unit.unit
data.pop('value')
data.pop('unit')
data['value'] = self.value * self.unit.unit
return data
import pydantic
import dspy
# Shareholder2 is the one I pass to dspy for summary
# You can ignore the entity and company name because they might be Chinese, just the Number (shares and two ratio) matters
class Shareholder2(pydantic.BaseModel):
actual_control_entity: str
intermediate_company: str | None
shares: Number | None
shareholding_ratio: Number | None
voting_rights_ratio: Number | None
def model_dump(self, **kwargs) -> dict[str, str | int | None]:
return {
'actual_control_entity': self.actual_control_entity,
'intermediate_company': self.intermediate_company,
'shares': self.shares.model_dump()['value'] if self.shares else None,
'shareholding_ratio': self.shareholding_ratio.model_dump()['value'] if self.shareholding_ratio else None,
'voting_rights_ratio': self.voting_rights_ratio.model_dump()['value'] if self.voting_rights_ratio else None,
}
def extract_shareholders2(model: str, text: str):
model_params = llm_models[model]
llm = dspy.LM(model=model, api_key=model_params['api_key'])
with dspy.context(lm=llm):
extract = dspy.Predict(ShareholdersSummary2)
return extract(text=text)
LM Model I use
gpt-4.1
Text for LLM to summarization
Debug Message from Phoenix Trace
LM.call
# message returned from LM.__call__
[[ ## shareholders ## ]]
[
{
"actual_control_entity": "王文彬先生、王文礼先生、陈雅静女士、吴小宁女士、王文超先生及王小萍女士一致行动集团",
"intermediate_company": "无(直接持有)",
"shares": {
"value": 31_862_625,
"unit": ""
},
"shareholding_ratio": {
"value": 37.49,
"unit": "%"
},
"voting_rights_ratio": {
"value": 37.49,
"unit": "%"
}
},
{
"actual_control_entity": "天津长峰",
"intermediate_company": "无(直接持有)",
"shares": {
"value": 5_220_000,
"unit": ""
},
"shareholding_ratio": {
"value": 6.14,
"unit": "%"
},
"voting_rights_ratio": {
"value": 6.14,
"unit": "%"
}
}
]
[[ ## completed ## ]]
ChatAdapter.call
# message returned from ChatAdapter.__call__
[
{
"shareholders": [
"actual_control_entity='王文彬先生、王文礼先生、陈雅静女士、吴小宁女士、王文超先生及王小萍女士一致行动集团' intermediate_company='无(直接持有)' shares=Number(value=31.0, unit=<NumbersUnit.ONE: (1, '', '', '')>) shareholding_ratio=Number(value=37.49, unit=<NumbersUnit.PERCENT: (0.01, '%', '%', '%')>) voting_rights_ratio=Number(value=37.49, unit=<NumbersUnit.PERCENT: (0.01, '%', '%', '%')>)",
"actual_control_entity='天津长峰' intermediate_company='无(直接持有)' shares=Number(value=5.0, unit=<NumbersUnit.ONE: (1, '', '', '')>) shareholding_ratio=Number(value=6.14, unit=<NumbersUnit.PERCENT: (0.01, '%', '%', '%')>) voting_rights_ratio=Number(value=6.14, unit=<NumbersUnit.PERCENT: (0.01, '%', '%', '%')>)"
]
}
]
My debug logic
(Phoenix Trace can see which method/function calls LLM) Only ChatAdapter was used, no JSONAdapter, so
class ChatAdapter(Adapter):
def __call__(
self,
lm: LM,
lm_kwargs: dict[str, Any],
signature: type[Signature],
demos: list[dict[str, Any]],
inputs: dict[str, Any],
) -> list[dict[str, Any]]:
try:
return super().__call__(lm, lm_kwargs, signature, demos, inputs)
# code must go into here otherwise JSONAdapter will be called
except Exception as e:
# call json adapter
def __call__(
self,
lm: "LM",
lm_kwargs: dict[str, Any],
signature: type[Signature],
demos: list[dict[str, Any]],
inputs: dict[str, Any],
) -> list[dict[str, Any]]:
processed_signature = self._call_preprocess(lm, lm_kwargs, signature, inputs)
inputs = self.format(processed_signature, demos, inputs)
outputs = lm(messages=inputs, **lm_kwargs) # value of outputs is the LM.__call__ returned value above
return self._call_postprocess(processed_signature, signature, outputs)
# here should return the ChatAdapter.__call__ returned value above
Because there are randomness in LLM output, not each LLM call is producing numbers with underscore. So I guess you can only reproduce the issue in the middle of the code using the message I collect from Phoenix Trace.
I tried to instruct it not to write underscore, but this didn't work well.
If you need more codes, please let me know.
@BoluoZz Thanks for the detailed information! Let me do some testing and update the deps accordingly.