agents
agents copied to clipboard
Testing judge often fails to parse arguments
Bug Description
When using the testing assertions with judge, we started seeing more and more cases of
services/agent_worker/src/agent/testing/run_story.py:501: in _assert_event
await chat_assert.judge(llm_judge, intent=story_event.judge)
services/agent_worker/.venv/lib/python3.12/site-packages/opentelemetry/util/_decorator.py:71: in async_wrapper
return await func(*args, **kwargs) # type: ignore
^^^^^^^^^^^^^^^^^^^^^^^^^^^
services/agent_worker/.venv/lib/python3.12/site-packages/livekit/agents/llm/utils.py:361: in prepare_function_arguments
args_dict = from_json(json_arguments)
^^^^^^^^^^^^^^^^^^^^^^^^^
E ValueError: EOF while parsing a string at line 1 column 66
which seems to be related to the judge not producing function arguments, which livekit expects.
We are not sure if we can do anything about this / there needs to be some specific judge prompt tuning.
We never had this issue 2 months ago or so, and this started to be roughly ~10% (vibes based estimate) of our failure cases at this point.
Expected Behavior
The judge should not fail with false arguments. We know LLMs can be tricky though.
Reproduction Steps
A little hard to reproduce at all. Any case where the judge is used with different intents.
We usually phrase intents like this (examples that lead to this failure):
* "Asks reason to transfer"
* "Lets the user know about the one open account and offers to make a payment."
Operating System
Linux (github actions)
Models Used
No response
Package Versions
❯ uv pip list
Package Version Editable project location
---------------------------------------- --------------- -------------------------------------------------------------------------
aiofiles 24.1.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.15
aiohttp-retry 2.9.1
aiosignal 1.4.0
annotated-types 0.7.0
anyio 4.10.0
attrs 25.3.0
av 15.1.0
aws-msk-iam-sasl-signer-python 1.0.2
boto3 1.40.21
botocore 1.40.21
bytecode 0.16.2
certifi 2025.8.3
cffi 1.17.1
charset-normalizer 3.4.3
click 8.2.1
colorama 0.4.6
coloredlogs 15.0.1
datadog 0.52.1
dateparser 1.2.2
ddtrace 3.17.0
distro 1.9.0
docstring-parser 0.17.0
envier 0.6.1
eval-type-backport 0.2.2
execnet 2.1.1
filelock 3.19.1
flatbuffers 25.2.10
frozenlist 1.7.0
fsspec 2025.7.0
googleapis-common-protos 1.70.0
grpcio 1.74.0
h11 0.16.0
hf-xet 1.1.9
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.34.4
humanfriendly 10.0
idna 3.10
importlib-metadata 8.7.0
iniconfig 2.1.0
jellyfish 1.2.1
jinja2 3.1.6
jiter 0.10.0
jmespath 1.0.1
kafka-python 2.2.15
livekit 1.0.19
livekit-agents 1.3.5
livekit-api 1.0.7
livekit-blingfire 1.0.0
livekit-plugins-cartesia 1.3.5
livekit-plugins-deepgram 1.3.5
livekit-plugins-noise-cancellation 0.2.5
livekit-plugins-openai 1.3.5
livekit-plugins-silero 1.3.5
livekit-plugins-turn-detector 1.3.5
livekit-protocol 1.1.0
lz4 4.4.4
markdown-it-py 4.0.0
markupsafe 3.0.2
mdurl 0.1.2
mpmath 1.3.0
multidict 6.6.4
nest-asyncio 1.6.0
networkx 3.5
nicknames 1.0.0
nodeenv 1.9.1
numpy 2.3.2
onnxruntime 1.22.1
openai 2.7.2
opentelemetry-api 1.36.0
opentelemetry-exporter-otlp 1.36.0
opentelemetry-exporter-otlp-proto-common 1.36.0
opentelemetry-exporter-otlp-proto-grpc 1.36.0
opentelemetry-exporter-otlp-proto-http 1.36.0
opentelemetry-proto 1.36.0
opentelemetry-sdk 1.36.0
opentelemetry-semantic-conventions 0.57b0
packaging 25.0
pillow 11.3.0
pluggy 1.6.0
prometheus-client 0.22.1
propcache 0.3.2
protobuf 6.32.0
psutil 7.0.0
pycparser 2.22
pydantic 2.11.7
pydantic-core 2.33.2
pydantic-settings 2.10.1
pygments 2.19.2
pyjwt 2.10.1
pyright 1.1.404
pytest 8.4.1
pytest-asyncio 1.1.0
pytest-mock 3.14.1
pytest-xdist 3.8.0
python-dateutil 2.9.0.post0
python-dotenv 1.1.1
python-json-logger 3.3.0
pytz 2025.2
pyyaml 6.0.2
regex 2025.8.29
requests 2.32.5
rich 14.2.0
s3transfer 0.13.1
safetensors 0.6.2
setuptools 80.9.0
shellingham 1.5.4
six 1.17.0
sniffio 1.3.1
sounddevice 0.5.2
sympy 1.14.0
tokenizers 0.22.0
torch 2.9.1
tqdm 4.67.1
transformers 4.56.0
typer 0.20.0
types-protobuf 4.25.0.20240417
typing-extensions 4.15.0
typing-inspection 0.4.1
tzlocal 5.3.1
urllib3 2.5.0
watchfiles 1.1.0
websockets 15.0.1
wrapt 1.17.3
yarl 1.20.1
zipp 3.23.0
Session/Room/Call IDs
No response
Proposed Solution
Additional Context
No response
Screenshots and Recordings
No response