[BFCL] Adds support for parallel inference and batching

Open TikZSZ opened this issue 7 months ago • 2 comments

Parallel Inference Support for berkeley-function-call-leaderboard

This PR adds support for running berkeley-function-call-leaderboard inference in parallel, reducing running time by 4x or more depending on --batch-size.

Changes

Modifies `berkeley-function-call-leaderboard/model_handler/handler.py`

Modified write function to make it async using aiofiles
Added sort_results function to sort the results based on idx after each individual test_cate is over
sort_results function returns the indices after sorting, supporting resuming functionality

Modifies `berkeley-function-call-leaderboard/openfunctions_evaluation.py`

Added --batch-size arg, defaults to 1 -> controls number of parallel requests
Refactored processing and result-writing logic to fetch_and_process function
Added make_async function to wrap sync functions as async (used for handler.inference)
Added nested progress bar for tracking iterations
Refactored core logic for processing under main function
Implemented proper resume support, replacing num_existing_lines

Resume Support

Improved resuming functionality in async code:

Addresses potential issues where some test cases could complete earlier than others, leading to inconsistent resume.
Filters already saved test cases instead of using a simple line count
For saved test cases, adds None as a placeholder, which becomes the conditional for skipping test cases
This approach ensures consistent resuming even if execution is interrupted mid-test
This is really important for models that are expensive to run, where re-running the whole test again is undesirable
A test log screenshot is also uploaded bottom of this PR to confirm that it works as intended

Note: This PR automatically wraps inference calls as async to minimize code changes, but the calls are still synchronous and will block the event loop so we use loop.run_in_executor to run the calls in parallel on multiple threads instead it uses min(32, os.cpu_count() + 4) threads by default. If handlers are made async in future they would continue to work like normal async code

Testing

Tested on a custom OpenAI-compatible model running on vllm:

Completed simple test in 40 seconds
Hardware: RTX4090
Model: LLAMA 8B BF16
Batch size: 15-20

Benchmark Results

Benchmark

Debug Logs for new `Resume System`

Debug-resume

Jul 02 '24 21:07 TikZSZ

gorilla gorilla copied to clipboard

[BFCL] Adds support for parallel inference and batching

Parallel Inference Support for berkeley-function-call-leaderboard

Changes

Modifies berkeley-function-call-leaderboard/model_handler/handler.py

Modifies berkeley-function-call-leaderboard/openfunctions_evaluation.py

Resume Support

Testing

Benchmark Results

Debug Logs for new Resume System

gorilla
gorilla copied to clipboard

Modifies `berkeley-function-call-leaderboard/model_handler/handler.py`

Modifies `berkeley-function-call-leaderboard/openfunctions_evaluation.py`

Debug Logs for new `Resume System`