gorilla
gorilla copied to clipboard
[BFCL] Support Parallel Inference for Hosted Models
This PR introduces multi-threading to parallel the API call to the hosted model endpoints and significantly speeds up the model response generation process.
User can specify the number of threads to use for parallel inference by setting the --num-threads flag. The default is 1, which means no parallel inference.