OpenHands bug: fix fn_call error during API response

[ ] This change is worth documenting at https://docs.all-hands.dev/
[ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

End-user friendly description of the problem this fixes or functionality this introduces.

Summarize what the PR does, explaining any non-trivial design decisions. This is a bug I found when receiving simulated fncall response, here's the original code snippet: https://github.com/All-Hands-AI/OpenHands/blob/77cd05c33b1eb292c336db772b973920f5b9daf4/openhands/llm/fn_call_converter.py#L728-L739

You can see line 733, the method is based on the assumption that response of simulated-fncalls should always end up with a content that len(content)==1, however this is not true for provider openrouter/anthropic。 Here's an response example a got:

[
    {
        "type": "text",
        "text": "EXECUTION RESULT of [browser]:\n"
    },
    {
        "type": "text",
        "text": "LARGE CONTENT HERE......[Current URL: http://localhost:53782/]\n[Focused element bid: 8]\n\n[Action executed successfully.]\n============== BEGIN accessibility tree ==============\nRootWebArea 'VSCode......Note that only visible portion of webpage is present in the screenshot. You may need to scroll to view the remaining portion of the web-page.)\n"
    },
    {
        "type": "image_url",
        "image_url": {
            "url": "..LARGE IMAGE BASE64 CONTENT HERE...... data:image/png;base64,iVBORw0......kSuQmCC"
        },
        "cache_control": {
            "type": "ephemeral"
        }
    }
]

And of course such response will lead to a raised error. The main issue with such response is that, it separates the response of fn_call into 2 contents(I don't know why), and apparently fn_call_converter didn't include such case. So I made this change to concatenate all text content into one.

As for result, it functions pretty well.

Link of any specific issues this addresses:

Not checked yet, but I'm pretty sure whoever is using openrouter/anthropic as a provider would encounter it.

May 25 '25 09:05 CoreJa

Thank you for the PR! This seems strange:

The main issue with such response is that, it separates the response of fn_call into 2 contents(I don't know why), and apparently fn_call_converter didn't include such case. So I made this change to concatenate all text content into one.

The tool result is something openhands creates, the model's response had a tool call / function call, and openhands processed it, and made an observation. I don't think we intended to create a result that fails our own checks. 😅 Maybe this is a recent bug, or a browser specific thing we may have missed. 🤔

May 25 '25 13:05 enyst

The tool result is something openhands creates, the model's response had a tool call / function call, and openhands processed it, and made an observation. I don't think we intended to create a result that fails our own checks. 😅 Maybe this is a recent bug, or a browser specific thing we may have missed. 🤔

Then it is indeed very strange. But let me first clarify that it is happening during the conversion from non_fncall messages to fncall ones. So the model itself is not aware that it is a fncall message.

Let me straight things up a little bit, just to make sure we don't misunderstand each other, it's more like this:

flowchart TD
    A[After some turns of conversation...] --> B[Codeact agent decide there should be a function call]
    B --> C[It is calling to a non-fncall-supported model]
    C --> D[Converting fncall into non_fncall]
    D --> E[Send the request to LLM's API]
    E --> F[Receives the response from LLM's API]
    F --> G[it follows the pattern of fncall simulations]
    G --> H[convert non_fncalls back to fncall, so that codeact agent can handles it properly]

I believe the problem fails on the last step. But I'm not sure if it's a inside issue within browser, or it is the LLM to blame for. The only thing I'm 100% sure is that, the response I received looks just like the way I described: the contents are split into 2 parts, at the time when it is converting non_fncalls back to fncalls.

From what I see, openhands did the observation, sent the request, but didn't make it when parsing the response, if I'm right about all this. Would you like to share more if me being wrong? I'm really interested about all these.

May 25 '25 13:05 CoreJa

I think you're right, I'm just saying this shouldn't have happened in the first place... and maybe I see the issue: https://github.com/All-Hands-AI/OpenHands/blob/0c2924453f04a2348c2c6df4296dfbfe29b21739/openhands/llm/fn_call_converter.py#L539

When the tool is "browser" and, as your example shows, it has both a 'text' and an 'image_url' in the list (content is a list in this case), then our logic there is... wrong, isn't it? We check if the last thing in the list is a text type:

if yes, concatenate the prefix to it (the prefix is the first thing in your example, the "EXECUTION..." part)
if not, create a new element from the prefix alone and prepend it.

But in your example, the last element is an image_url => so we miss the text

I think the logic should be instead: if content is a list, find the first text type element, concatenate prefix to it, if none, only then create a new element for the prefix. What do you think, does that make sense?

May 25 '25 14:05 enyst

When the tool is "browser" and, as your example shows, it has both a 'text' and an 'image_url' in the list (content is a list in this case), then our logic there is... wrong, isn't it? We check if the last thing in the list is a text type.

OH! I see!! I didn't check how it becomes like that in the first place, so I only fix how it parses the response. You're definitely right about this, we should never create such a content and include it in the request!! I didn't even know it is an element we created there!

I think the logic should be instead: if content is a list, find the first text type element, concatenate prefix to it, if none, only then create a new element for the prefix. What do you think, does that make sense?

Absolutely! The expectation of a simulated fncall should be wrapped into one content, especially here what we want is just to make sure the fncall starts with a PREFIX we want.

I'll commit your logic in a minute. Then you can continue reviewing or testing.

May 25 '25 14:05 CoreJa

Thank you!

On a side note, if you run make build, I think it should install pre-commit hook, so that linting step finds and fixes automatically linting issues.

May 25 '25 15:05 enyst

Could you perhaps add a unit test for this? We have these tests in test_llm_fncall_converter.py` normally.

May 25 '25 15:05 enyst

I think this is good to go now, plz have a look. And I really appreciate your patient support of all these. @enyst

May 26 '25 05:05 CoreJa