Json loads failed when LLM outputs contain boolean value
🐛 Bug Description
I used volcengine api to served as chat API. I found it always failed when the LLM go to this step. The errors is something like Expecting value: line xx column xx (char xx). It's a json decode error.
{
"final_decision": True,
"final_feedback": "The factor '10D_Momentum_Normalized' is implemented correctly. The code executed successfully, the output file 'result.h5' is found, and the factor value dataframe meets all specified requirements (correct MultiIndex, column name, data type, and no infinite values). The daily data alignment and absence of errors confirm the implementation is accurate."
}
The cause of this error is that the API returned by Volcengine uses boolean values that begin with an uppercase letter, which follows Python's convention. However, this is not valid in standard JSON, so json.loads will raise an error when parsing such responses.
So I added a simple function to convert the boolean value in the all_response.
def replace_true_false(self, s: str) -> str:
def repl(match: Any) -> Any:
return match.group(0).lower()
return re.sub(r"\b(True|False|TRUE|FALSE)\b", repl, s)
This is just a temporary solution. Hope for a better approach.
To Reproduce
Steps to reproduce the behavior:
rdaegent fin_factor
You can change json output template True to true, False to false in code.
You can change json output template True to true, False to false in code.
yes, you are right. But I am curious why the output returned by openai can be unmarshal by json.loads. Am I the only one meet this issue?
If you change all "True/False" into "true/false", there will be another problem:
when the llm return a str contains code, like df.reset_index(replace=True), it will be changed to be a wrong code, and will confuse LLM
We have fixed this problem in rdagent/oai/backend/base.py via this function:
def _fix_python_booleans(json_str: str) -> str:
"""Safely fix Python-style booleans to JSON standard format using tokenize"""
replacements = {"True": "true", "False": "false", "None": "null"}
try:
out = []
io_string = io.StringIO(json_str)
tokens = tokenize.generate_tokens(io_string.readline)
for toknum, tokval, _, _, _ in tokens:
if toknum == tokenize.NAME and tokval in replacements:
out.append(replacements[tokval])
else:
out.append(tokval)
result = "".join(out)
return result
except (tokenize.TokenError, json.JSONDecodeError):
# If tokenize fails, fallback to regex method
for python_val, json_val in replacements.items():
json_str = re.sub(rf"\b{python_val}\b", json_val, json_str)
return json_str
Hope can help you.
We have fixed this problem in
rdagent/oai/backend/base.pyvia this function:def _fix_python_booleans(json_str: str) -> str: """Safely fix Python-style booleans to JSON standard format using tokenize""" replacements = {"True": "true", "False": "false", "None": "null"}
try: out = [] io_string = io.StringIO(json_str) tokens = tokenize.generate_tokens(io_string.readline) for toknum, tokval, _, _, _ in tokens: if toknum == tokenize.NAME and tokval in replacements: out.append(replacements[tokval]) else: out.append(tokval) result = "".join(out) # Validate if the result is valid JSON json.loads(result) return result except (tokenize.TokenError, json.JSONDecodeError): # If tokenize fails, fallback to regex method for python_val, json_val in replacements.items(): json_str = re.sub(rf"\\b{python_val}\\b", json_val, json_str) return json_strHope can help you.
Actually the json error with bool values in respones is caused by the two prompts rdagent/components/coder/factor_coder/prompts.yaml and rdagent/components/coder/model_coder/prompts.yaml.
Because some template in those two prompts files contain True instead of true as most other prompts files, which will affect the LLM to output True and raise Json Error.
After replacing true in the prompts, the json error never come up.
That's true, but since prompt formats might vary in the future, I think checking and fixing it at runtime is a more robust and elegant solution. Thanks for pointing out the root cause in the prompts.