[BFCL] The multi-turn generation hangs and doesn’t progress.
Hi,
I have been trying to run generation for multi-turn use cases, but after many attempts, it still doesn't work ( I waited like 24 hours). Other tests run fine — the only issue is that multi-turn generation doesn't work.
It usually gets stuck around here which it seems around 1% progress:
ID: long_context_86, Turn: 0, Step: 9
----------------------------------------------------------------------------------------------------
ID: long_context_52, Turn: 0, Step: 12
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_139, Turn: 1, Step: 0
----------------------------------------------------------------------------------------------------
ID: long_context_129, Turn: 0, Step: 2
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_142, Turn: 0, Step: 0
----------------------------------------------------------------------------------------------------
ID: long_context_47, Turn: 1, Step: 5
----------------------------------------------------------------------------------------------------
ID: base_199, Turn: 0, Step: 10
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 6
----------------------------------------------------------------------------------------------------
ID: long_context_115, Turn: 0, Step: 6
----------------------------------------------------------------------------------------------------
ID: long_context_73, Turn: 0, Step: 11
----------------------------------------------------------------------------------------------------
ID: long_context_80, Turn: 0, Step: 9
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_109, Turn: 6, Step: 0
----------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------
ID: long_context_81, Turn: 0, Step: 10
ID: base_155, Turn: 0, Step: 17
----------------------------------------------------------------------------------------------------
ID: base_163, Turn: 0, Step: 13
I use the following commands:
bfcl generate --model Mymodel --test-category multi_turn --backend vllm --num-gpus 1 --gpu-memory-utilization 0.9 --include-input-log
and
python -m vllm.entrypoints.openai.api_server --model Mymodel --dtype bfloat16 --port 1053
and the commit id is: 9108a651ec3
Any help would be highly appreciated. Thanks.
Hey @rasoolfa , Thanks for the issue. How big is your model, and what kind of GPU are you running it on?
Thanks @HuanzhiMao for quick reply. It is 12B and I am using one H100 to do generation and using the same machine to host vllm too. This happens with gemma-3-12B as well as another reference point.
My hypothesis is that the generation is slow (because of the model size relative to the GPU), but it's not stuck forever. Could you try one thing: In here, change the worker number to a lower value, say 1/5/10, and see if the generation is happening. You can check either by looking at the terminal outputs or by looking at the entries written in the result files. Also, you can check for GPU usage. If it's always at almost 100% utilization rate, then it's just slow.
Thanks, I will try that. I think gpu utilization is around 100% and memory usage is ~70GB.
Not sure that this helps to debug, but whenever I run bfcl generate for the other tests, I often need to run it twice as I get the following error the first time:
Exception: Subprocess terminated unexpectedly with code -9
However, the test (except multi-turn) runs successfully on the second attempt.
Interesting. I have never encountered that. Could you provide the full trace for the error you got the first time?
By First run I meant, if I run bfcl generate for a model that I never ran it before, I get error. But it runs the second time. I checked my logs — the first run (on a new model) hits an OOM error, but the second run succeeds, even though nvidia-smi shows 0 memory usage before the first run. Not sure what's being cached or initialized during the first attempt, but something is likely causing a spike that leads to the OOM.
PS: the trace is long, otherwise I would copy here, But above was the TLDR.
That sounds like an issue with VLLM; maybe updating vllm to a more recent version might help with the problem. The bfcl pipeline itself doesn't deal with GPU usage. It just spins up a VLLM server (here) and queries that endpoint.
Setting num_workers to 10 seems to help. Previously, with num_workers=100, only ~1 out of 800 samples would succeed. After lowering it to 10, I was able to get up to 272/800. However, after that point, the process gets stuck again ( it is very slow now)— GPU utilization drops to 0 and it is slow. I will try num_workers=1 to see if it helps.
So with num_workers=1, I could get far more than others, but it still hangs. The gpu utilization becomes 0%. Is there something else that I can look into that possibily can help to debug multi-turn generation and evaluation?
Do you know which entry it hangs at?
Since num_workers=1 is essentially sequential generation, you can look at the result files to find out all the entries that have been generated. The next ID is the one it hangs.
thanks again for your help. it hangs here:
ID: long_context_105, Turn: 0, Step: 3
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 5
----------------------------------------------------------------------------------------------------
ID: long_context_99, Turn: 3, Step: 5
----------------------------------------------------------------------------------------------------
ID: long_context_98, Turn: 1, Step: 17
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 7
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 6
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 4
----------------------------------------------------------------------------------------------------
ID: long_context_98, Turn: 1, Step: 18
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 7
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 8
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 5
----------------------------------------------------------------------------------------------------
ID: long_context_98, Turn: 1, Step: 19
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 8
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 9
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_106, Turn: 0, Step: 0
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 6
----------------------------------------------------------------------------------------------------
ID: long_context_104, Turn: 1, Step: 9
----------------------------------------------------------------------------------------------------
ID: long_context_98, Turn: 1, Step: 20
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_106, Turn: 1, Step: 0
----------------------------------------------------------------------------------------------------
ID: long_context_97, Turn: 0, Step: 10
503 out of 800 still remains and it only finished ~297 cases.
This doesn't looks right if num worker is set to 1.
sorry, you are right. I copied result of 5 workers. here is the result of 1 worker:
Failed to decode the model response. Proceed to next turn.
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 0
Generating results for My model 2%|█▋ | 8/503 [02:24<4:08:21, 30.10s/it]----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 1
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 2
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 3
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 4
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 5
----------------------------------------------------------------------------------------------------
ID: long_context_105, Turn: 0, Step: 6
Seems like long_context_105 is causing the issue. Still with 1 worker, run only that entry (you can do so with --run-ids), add a few print statements throughout this function to figure out which part is hanging. Is it stuck waiting for the server response, or is it stuck executing the model response tool call, etc. It might also be helpful if you print out the model response at each turn.
thanks, this is very helpful. I will follow above suggestions and will update you here.
Hey @rasoolfa , Following up on this thread, are you able to find the root cause of this hanging issue?
Sorry forgot to comment here. It turned out I needed to use more than one gpus for gemma-3-12b-it. I was originally using a single H100, but for some reason, that setup didn’t work for gemma-3-12b-it. After switching to 2 or even 8 gpus, I was able to get multi-turn results, it was very slow though. Note that with a single gpu, all tests work except multi_turn. Also, changing the worker number to a lower value didn't seem to resolve the issue.
bfcl generate --model google/gemma-3-12b-it --test-category simple,multiple,parallel,python,non_python,multi_turn --backend vllm --num-gpus 2 --gpu-memory-utilization 0.9 --include-input-log
python -m vllm.entrypoints.openai.api_server --model google/gemma-3-12b-it --dtype bfloat16 --port 1053 --tensor-parallel-size 2
Hey @HuanzhiMao
I'm using gpt 4.1 and facing a similar issue. What I have found out so far is that it fails to decode the response when it is not a function call. Here is the screenshot where I have added the model response as well. This issue is only coming in multi_turn samples.
Command
!python -m bfcl_eval generate --model gpt-4.1-2025-04-14 --test-category multi_turn_base --result-dir v3_results --temperature 1
I'm having the same issue
I found a method that works for me. When executing multi-turn, split the task into four parts. This way, it will be stable and no problems will occur. The number of threads is set to 200 respectively:
--test-category multi_turn_base --test-category multi_turn_miss_func --test-category multi_turn_miss_param --test-category multi_turn_long_context