FastChat Speed issue. It is slow, seems like one sec one token generated

Dear Developers and Friends,

I face it is very slow when using CPU, my environment is 1TB memory and 100C.

I see from the function generate_stream. It seems like every loop it only generate one token, and adding the token to the content, using the new content to generate the next token.

If it works like this, I see with my environment, it is very slow to generate a full result.

Do we have some ways to generate the full result at once but not generate a token and adding the token generate again?

Thanks for your help.

Best Regards, LsEmpire

May 02 '23 09:05 LsEmpire

If you're running inference on CPU, you should expect the slower speed, if you're running on a GPU, the generation is much faster

May 02 '23 17:05 SupreethRao99

If you're running inference on CPU, you should expect the slower speed, if you're running on a GPU, the generation is much faster

Thanks, thanks for your help and quick reply. I have another question, if it works like this: every time it just predict one token. It seems it predict one token, and with the token it predict the next one token.

So it seems like every time it only predict one token.

If it works like this? Do we have some way, it can predict all token at once.

And I see it is strange, one second one token, if there are some settings for this or it is the best speed that my machine can achieve.

I set the stream_interval to -1

Thanks for your help.

Best Regards, LsEmpire

May 03 '23 03:05 LsEmpire

Im afraid you can't predict all the tokens at once , the deciding process of most language models is autoregressive , ie you need tokens 0 - n to predict the n+1th token and you need 0 - n+1 to predict then n+2th token

May 03 '23 03:05 SupreethRao99

Thanks, thanks for your reply.

I see the code is from generate_stream function in file inference.py

Could you please help again to check if it is the 0 - n to predict n + 1?

I see there is a for iteration, inside of the for iteration, every round it just get one logits = model.lm_head(out[0]).

Thanks for your help, could you please help to read the code if it is this.

Function generate_stream from file inference.py

Thanks again for your help.

Best Regards, LsEmpire

May 03 '23 05:05 LsEmpire

Hi, as you are already aware, if you use the worker for the generation, you get the streaming output. If you need all the outputs at once, you can look at the serve/huggingface_api. But then you have two disadvantages:

Implementation by yourself an API
Wait time till the entire output is computed

May 05 '23 07:05 sudarshan-kamath

Thanks

Hi, as you are already aware, if you use the worker for the generation, you get the streaming output. If you need all the outputs at once, you can look at the serve/huggingface_api. But then you have two disadvantages:

Implementation by yourself an API

Wait time till the entire output is computed

Thanks for your reply and help.

I tried to write code to generate at once, it successfully and the total time reduced about 20%. But it is not stable, in most of case, it is more quick than the stream API.

Best Regards, LsEmpire

May 05 '23 08:05 LsEmpire

FastChat FastChat copied to clipboard

Speed issue. It is slow, seems like one sec one token generated

FastChat
FastChat copied to clipboard