OpenHands [Bug]: (eval) Instance results with llm proxy `OpenAIException` errors got merged into output.jsonl

Is there an existing issue for the same bug?

[X] I have checked the troubleshooting document at https://docs.all-hands.dev/modules/usage/troubleshooting
[X] I have checked the existing issues.

Describe the bug

When running the eval via the All Hands AI's LLM proxy, sometimes the server crashed with 502 response. The eval result is still collected into the output.jsonl file with the error field being:

"error": "There was an unexpected error while running the agent: litellm.APIError: APIError: OpenAIException - <html><head>\n<meta http-equiv=\"content-type\" content=\"text/html;charset=utf-8\">\n<title>502 Server Error</title>\n</head>\n<body text=#000000 bgcolor=#ffffff>\n<h1>Error: Server Error</h1>\n<h2>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.</h2>\n<h2></h2>\n</body></html>",

Then we have to manually filter out instances with that error and rerun. Maybe we should have some kind of logic to automatically retry for this scenario.

Current OpenHands version

0.9.7

Installation and Configuration

ALLHANDS_API_KEY="<all hands ai remote runtime key>" RUNTIME=remote SANDBOX_REMOTE_RUNTIME_API_URL="https://runtime.eval.all-hands.dev" EVAL_DOCKER_IMAGE_PREFIX="us-central1-docker.pkg.dev/evaluation-092424/swe-bench-images" ./evaluation/swe_bench/scripts/run_infer.sh llm.eval_sonnet_3_5 HEAD CoActPlannerAgent 100 40 1 "princeton-nlp/SWE-bench_Lite" test

Model and Agent

Model: openai/claude-3-5-sonnet@20240620
Agent: CoActPlannerAgent

Operating System

Linux

Reproduction Steps

No response

Logs, Errors, Screenshots, and Additional Context

No response

Oct 02 '24 06:10 ryanhoangt

@ryanhoangt Can you please post a traceback from the logs if you have, by any chance, or the .jsonl ? I made a quick fix in the linked PR, I'd like to look at it some more though.

Oct 02 '24 10:10 enyst

Unfortunately from the trajectory in the jsonl file there're no traceback. There's only one last entry from the history field beside the error field above. I can try capturing the traceback (if having any) from the log directly next time.

{
    "id": 84,
    "timestamp": "2024-10-02T10:06:45.050451",
    "source": "agent",
    "message": "There was an unexpected error while running the agent",
    "observation": "error",
    "content": "There was an unexpected error while running the agent",
    "extras": {}
}

I'm also quite confused about whether it is litellm.APIError or OpenAIException. From the doc seems to me like OpenAIException is a provider-specific exception and litellm.APIError is a wrapper for all providers.

Oct 02 '24 11:10 ryanhoangt

The linked PR added retries from our LLM class, but I think a better fix will retry the eval or make sure it's not in jsonl so that it will be attempted again.

Oct 05 '24 22:10 enyst

Thanks for the fix! Btw can you explain why retrying the whole eval is better? Not sure about the architectural side, but imo it may be not necessary to run again from the first step (especially when we're at the very end of the trajectory).

Oct 07 '24 05:10 ryanhoangt

Oh, they're not exclusive. The request is retried now, and we can configure the retry settings to make more attempts (in config.toml for the respective llm.eval group). You may want to do that, give it as much time as you see fit... That will retry from the current state.

But well, there will be a limit, so my thinking here is simply that if the proxy continues to be unavailable at that time I'm guessing the reasonable thing is to give it up, just don't save it in the jsonl so we can rerun it. 🤔

Oct 07 '24 07:10 enyst

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Nov 07 '24 01:11 github-actions[bot]

I think I saw another error merged into the jsonl, but... when it was only 1 task and 1 worker. We usually use multiprocessing lately, which might be why we don't see it. Maybe.

On the other hand, we have meanwhile made more fixes and added some retry when inference ends abnormally, before it gets to the output file, maybe it was fixed.

Nov 07 '24 02:11 enyst

Yeah, from my side I can see the retries happen after your fix. Recently with the new LLM proxy I don't even receive 502 errors anymore. Maybe this PR can be closed.

Nov 07 '24 09:11 ryanhoangt