gorilla icon indicating copy to clipboard operation
gorilla copied to clipboard

[BFCL] Adds support for parallel inference and batching

Open TikZSZ opened this issue 7 months ago • 2 comments

Parallel Inference Support for berkeley-function-call-leaderboard

This PR adds support for running berkeley-function-call-leaderboard inference in parallel, reducing running time by 4x or more depending on --batch-size.

Changes

Modifies berkeley-function-call-leaderboard/model_handler/handler.py

  • Modified write function to make it async using aiofiles
  • Added sort_results function to sort the results based on idx after each individual test_cate is over
  • sort_results function returns the indices after sorting, supporting resuming functionality

Modifies berkeley-function-call-leaderboard/openfunctions_evaluation.py

  • Added --batch-size arg, defaults to 1 -> controls number of parallel requests
  • Refactored processing and result-writing logic to fetch_and_process function
  • Added make_async function to wrap sync functions as async (used for handler.inference)
  • Added nested progress bar for tracking iterations
  • Refactored core logic for processing under main function
  • Implemented proper resume support, replacing num_existing_lines

Resume Support

Improved resuming functionality in async code:

  • Addresses potential issues where some test cases could complete earlier than others, leading to inconsistent resume.
  • Filters already saved test cases instead of using a simple line count
  • For saved test cases, adds None as a placeholder, which becomes the conditional for skipping test cases
  • This approach ensures consistent resuming even if execution is interrupted mid-test
  • This is really important for models that are expensive to run, where re-running the whole test again is undesirable
  • A test log screenshot is also uploaded bottom of this PR to confirm that it works as intended

Note: This PR automatically wraps inference calls as async to minimize code changes, but the calls are still synchronous and will block the event loop so we use loop.run_in_executor to run the calls in parallel on multiple threads instead it uses min(32, os.cpu_count() + 4) threads by default. If handlers are made async in future they would continue to work like normal async code

Testing

Tested on a custom OpenAI-compatible model running on vllm:

  • Completed simple test in 40 seconds
  • Hardware: RTX4090
  • Model: LLAMA 8B BF16
  • Batch size: 15-20

Benchmark Results

Benchmark

Debug Logs for new Resume System

Debug-resume

TikZSZ avatar Jul 02 '24 21:07 TikZSZ