FastChat Is there a way to optimize the output token per second?

Is there a way to optimize the output token per second?

Open vinvcn opened this issue 2 years ago • 1 comments

Hi there,

I understand autoregression which outputs words one by one. With some manual benchmark, our deployment gives 50 English words in 6 seconds. Is there a way to optimize this? We plan to use better machine with 8 times of GPUs to test it, but before that, we want to do some proof researching.

We are using: vicuna13b

Regards,

May 08 '23 09:05 vinvcn

we use a10 to inference, also looks slow.

May 08 '23 14:05 maplessssy

we use a10 to inference, also looks slow.

Could you provide a brief measurement on that? etc. token/word per second

May 09 '23 01:05 vinvcn

we use a10 to inference, also looks slow.

Could you provide a brief measurement on that? etc. token/word per second

@vinvcn

len(output) = 6756 (bytes or ?) duration = 109 seconds len/duration = 61/s

May 09 '23 02:05 maplessssy

4xA100, and it's speed，seems not use GPUS fulling， Is there a way to improve inference speed? thx

Command to start vicuna is : python3 -m fastchat.serve.model_worker --model-path /root/vicuna/vicuna13b --num-gpus 4

@vinvcn

May 09 '23 06:05 jisheir

You can use other fast inference libraries like FasterTransformers. We will also soon release a high throughput batching backend.

May 20 '23 14:05 merrymercy

You can use other fast inference libraries like FasterTransformers. We will also soon release a high throughput batching backend.

Thanks for your work to improve this. Would you please give some hint about the idea of throughput batching for our analysis?

May 27 '23 13:05 vinvcn

FastChat FastChat copied to clipboard

Is there a way to optimize the output token per second?

FastChat
FastChat copied to clipboard