gorilla icon indicating copy to clipboard operation
gorilla copied to clipboard

[BFCL] Support Parallel Inference for Hosted Models

Open HuanzhiMao opened this issue 1 year ago • 0 comments

This PR introduces multi-threading to parallel the API call to the hosted model endpoints and significantly speeds up the model response generation process.

User can specify the number of threads to use for parallel inference by setting the --num-threads flag. The default is 1, which means no parallel inference.

HuanzhiMao avatar Aug 07 '24 08:08 HuanzhiMao