gorilla [BFCL] The multi-turn generation hangs and doesn’t progress.

Hi,

I have been trying to run generation for multi-turn use cases, but after many attempts, it still doesn't work ( I waited like 24 hours). Other tests run fine — the only issue is that multi-turn generation doesn't work.

It usually gets stuck around here which it seems around 1% progress:

ID: long_context_86, Turn: 0, Step: 9
----------------------------------------------------------------------------------------------------
ID: long_context_52, Turn: 0, Step: 12
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_139, Turn: 1, Step: 0
----------------------------------------------------------------------------------------------------
ID: long_context_129, Turn: 0, Step: 2
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_142, Turn: 0, Step: 0
----------------------------------------------------------------------------------------------------
ID: long_context_47, Turn: 1, Step: 5
----------------------------------------------------------------------------------------------------
ID: base_199, Turn: 0, Step: 10
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 6
----------------------------------------------------------------------------------------------------
ID: long_context_115, Turn: 0, Step: 6
----------------------------------------------------------------------------------------------------
ID: long_context_73, Turn: 0, Step: 11
----------------------------------------------------------------------------------------------------
ID: long_context_80, Turn: 0, Step: 9
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_109, Turn: 6, Step: 0
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
ID: long_context_81, Turn: 0, Step: 10
ID: base_155, Turn: 0, Step: 17
----------------------------------------------------------------------------------------------------
ID: base_163, Turn: 0, Step: 13

I use the following commands: bfcl generate --model Mymodel --test-category multi_turn --backend vllm --num-gpus 1 --gpu-memory-utilization 0.9 --include-input-log

and python -m vllm.entrypoints.openai.api_server --model Mymodel --dtype bfloat16 --port 1053

and the commit id is: 9108a651ec3

Any help would be highly appreciated. Thanks.

Apr 20 '25 07:04 rasoolfa

Hey @rasoolfa , Thanks for the issue. How big is your model, and what kind of GPU are you running it on?

Apr 20 '25 07:04 HuanzhiMao

Thanks @HuanzhiMao for quick reply. It is 12B and I am using one H100 to do generation and using the same machine to host vllm too. This happens with gemma-3-12B as well as another reference point.

Apr 20 '25 07:04 rasoolfa

My hypothesis is that the generation is slow (because of the model size relative to the GPU), but it's not stuck forever. Could you try one thing: In here, change the worker number to a lower value, say 1/5/10, and see if the generation is happening. You can check either by looking at the terminal outputs or by looking at the entries written in the result files. Also, you can check for GPU usage. If it's always at almost 100% utilization rate, then it's just slow.

Apr 20 '25 08:04 HuanzhiMao

Thanks, I will try that. I think gpu utilization is around 100% and memory usage is ~70GB. Not sure that this helps to debug, but whenever I run bfcl generate for the other tests, I often need to run it twice as I get the following error the first time: Exception: Subprocess terminated unexpectedly with code -9 However, the test (except multi-turn) runs successfully on the second attempt.

Apr 20 '25 08:04 rasoolfa

Interesting. I have never encountered that. Could you provide the full trace for the error you got the first time?

Apr 20 '25 08:04 HuanzhiMao

By First run I meant, if I run bfcl generate for a model that I never ran it before, I get error. But it runs the second time. I checked my logs — the first run (on a new model) hits an OOM error, but the second run succeeds, even though nvidia-smi shows 0 memory usage before the first run. Not sure what's being cached or initialized during the first attempt, but something is likely causing a spike that leads to the OOM.

PS: the trace is long, otherwise I would copy here, But above was the TLDR.

Apr 20 '25 08:04 rasoolfa

That sounds like an issue with VLLM; maybe updating vllm to a more recent version might help with the problem. The bfcl pipeline itself doesn't deal with GPU usage. It just spins up a VLLM server (here) and queries that endpoint.

Apr 20 '25 08:04 HuanzhiMao

Setting num_workers to 10 seems to help. Previously, with num_workers=100, only ~1 out of 800 samples would succeed. After lowering it to 10, I was able to get up to 272/800. However, after that point, the process gets stuck again ( it is very slow now)— GPU utilization drops to 0 and it is slow. I will try num_workers=1 to see if it helps.

Apr 20 '25 08:04 rasoolfa

So with num_workers=1, I could get far more than others, but it still hangs. The gpu utilization becomes 0%. Is there something else that I can look into that possibily can help to debug multi-turn generation and evaluation?

Apr 21 '25 08:04 rasoolfa

Do you know which entry it hangs at?

Apr 21 '25 08:04 HuanzhiMao

Since num_workers=1 is essentially sequential generation, you can look at the result files to find out all the entries that have been generated. The next ID is the one it hangs.

Apr 21 '25 08:04 HuanzhiMao

thanks again for your help. it hangs here:

ID: long_context_105, Turn: 0, Step: 3
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 5
----------------------------------------------------------------------------------------------------
ID: long_context_99, Turn: 3, Step: 5
----------------------------------------------------------------------------------------------------
ID: long_context_98, Turn: 1, Step: 17
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 7
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 6
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 4
----------------------------------------------------------------------------------------------------
ID: long_context_98, Turn: 1, Step: 18
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 7
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 8
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 5
----------------------------------------------------------------------------------------------------
ID: long_context_98, Turn: 1, Step: 19
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 8
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 9
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_106, Turn: 0, Step: 0
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 6
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 9
----------------------------------------------------------------------------------------------------
ID: long_context_98, Turn: 1, Step: 20
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_106, Turn: 1, Step: 0
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 10

503 out of 800 still remains and it only finished ~297 cases.

Apr 21 '25 08:04 rasoolfa

This doesn't looks right if num worker is set to 1.

Apr 21 '25 08:04 HuanzhiMao

sorry, you are right. I copied result of 5 workers. here is the result of 1 worker:

Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 0
Generating results for My model   2%|█▋                                                                                                          | 8/503 [02:24<4:08:21, 30.10s/it]----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 1
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 2
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 3
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 4
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 5
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 6

Apr 21 '25 09:04 rasoolfa

Seems like long_context_105 is causing the issue. Still with 1 worker, run only that entry (you can do so with --run-ids), add a few print statements throughout this function to figure out which part is hanging. Is it stuck waiting for the server response, or is it stuck executing the model response tool call, etc. It might also be helpful if you print out the model response at each turn.

Apr 21 '25 09:04 HuanzhiMao

thanks, this is very helpful. I will follow above suggestions and will update you here.

Apr 21 '25 09:04 rasoolfa

Hey @rasoolfa , Following up on this thread, are you able to find the root cause of this hanging issue?

Jun 03 '25 17:06 HuanzhiMao

Sorry forgot to comment here. It turned out I needed to use more than one gpus for gemma-3-12b-it. I was originally using a single H100, but for some reason, that setup didn’t work for gemma-3-12b-it. After switching to 2 or even 8 gpus, I was able to get multi-turn results, it was very slow though. Note that with a single gpu, all tests work except multi_turn. Also, changing the worker number to a lower value didn't seem to resolve the issue.

bfcl generate --model  google/gemma-3-12b-it --test-category simple,multiple,parallel,python,non_python,multi_turn --backend vllm --num-gpus 2 --gpu-memory-utilization 0.9 --include-input-log
python -m vllm.entrypoints.openai.api_server --model google/gemma-3-12b-it --dtype bfloat16 --port 1053 --tensor-parallel-size 2

Jun 03 '25 21:06 rasoolfa

Hey @HuanzhiMao

I'm using gpt 4.1 and facing a similar issue. What I have found out so far is that it fails to decode the response when it is not a function call. Here is the screenshot where I have added the model response as well. This issue is only coming in multi_turn samples.

Command !python -m bfcl_eval generate --model gpt-4.1-2025-04-14 --test-category multi_turn_base --result-dir v3_results --temperature 1

Jun 13 '25 12:06 huzaifa50

I'm having the same issue

Jul 11 '25 10:07 hlcle

I found a method that works for me. When executing multi-turn, split the task into four parts. This way, it will be stable and no problems will occur. The number of threads is set to 200 respectively:

--test-category multi_turn_base --test-category multi_turn_miss_func --test-category multi_turn_miss_param --test-category multi_turn_long_context

Jul 15 '25 04:07 hlcle