gorilla
gorilla copied to clipboard
[BFCL] Adds support for parallel inference and batching
Parallel Inference Support for berkeley-function-call-leaderboard
This PR adds support for running berkeley-function-call-leaderboard
inference in parallel, reducing running time by 4x or more depending on --batch-size
.
Changes
Modifies berkeley-function-call-leaderboard/model_handler/handler.py
- Modified
write
function to make it async usingaiofiles
- Added
sort_results
function to sort the results based onidx
after each individual test_cate is over -
sort_results
function returns the indices after sorting, supporting resuming functionality
Modifies berkeley-function-call-leaderboard/openfunctions_evaluation.py
- Added
--batch-size
arg, defaults to1
-> controls number of parallel requests - Refactored processing and result-writing logic to
fetch_and_process
function - Added
make_async
function to wrap sync functions as async (used for handler.inference) - Added nested progress bar for tracking iterations
- Refactored core logic for processing under
main
function - Implemented proper resume support, replacing
num_existing_lines
Resume Support
Improved resuming functionality in async code:
- Addresses potential issues where some test cases could complete earlier than others, leading to inconsistent resume.
- Filters already saved test cases instead of using a simple line count
- For saved test cases, adds
None
as a placeholder, which becomes the conditional for skipping test cases - This approach ensures consistent resuming even if execution is interrupted mid-test
- This is really important for models that are expensive to run, where re-running the whole test again is undesirable
- A test log
screenshot
is also uploaded bottom of this PR to confirm that it works as intended
Note: This PR automatically wraps inference calls as async to minimize code changes, but the calls are still synchronous and will block the event loop so we use
loop.run_in_executor
to run the calls in parallel on multiple threads instead it usesmin(32, os.cpu_count() + 4)
threads by default. If handlers are made async in future they would continue to work like normal async code
Testing
Tested on a custom OpenAI-compatible model running on vllm:
- Completed simple test in 40 seconds
- Hardware: RTX4090
- Model: LLAMA 8B BF16
- Batch size: 15-20