FastChat
FastChat copied to clipboard
FastChat with API using only one processor core on CPU for output generation
I am running this on CPU, and I see that if I provide a lot of input the time it processes this with all available CPU cores takes longer than if I provide small input. But after that is read in only one processor core actually does the generation when using the OpenAI local drop-in API while it's using all available processor cores all the time on the CLI. Is there a parameter that controls this, or is this a bug?
I think CLI and API (model worker) use the same underlying code. If your observation is true, there might be some bugs (e.g, environmental variable settings). Could you help us fix it?
My python is not very good yet, but sure, let me know how I can help.
CLI is using only 1 cpu core too.