FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

fastchat.serve.model_worker --device cpu only uses one CPU Thread for token generation.

Open j0schi opened this issue 2 years ago • 0 comments

Hi,

i launch a worker with python3 -m fastchat.serve.model_worker --model-path /home/llamaweights/vicuna-13b --device cpu and then the webGUI, which works fine so far. When i do a request, after an initial loading time one core goes to 100% while the others idle. If i make a second request in another tap another core goes to 100% while the other 14 idle... Token generation is very slow, but does not get any slower for additional requests. Can i somehow use all 16 threads or at least all 8 cores for a single request to speed up token generation?

Kind regards

j0schi avatar Jun 07 '23 14:06 j0schi