FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

Is there a way to optimize the output token per second?

Open vinvcn opened this issue 2 years ago • 1 comments

Hi there,

I understand autoregression which outputs words one by one. With some manual benchmark, our deployment gives 50 English words in 6 seconds. Is there a way to optimize this? We plan to use better machine with 8 times of GPUs to test it, but before that, we want to do some proof researching.

We are using: vicuna13b

Regards,

vinvcn avatar May 08 '23 09:05 vinvcn

we use a10 to inference, also looks slow.

maplessssy avatar May 08 '23 14:05 maplessssy

we use a10 to inference, also looks slow.

Could you provide a brief measurement on that? etc. token/word per second

vinvcn avatar May 09 '23 01:05 vinvcn

we use a10 to inference, also looks slow.

Could you provide a brief measurement on that? etc. token/word per second

@vinvcn

len(output) = 6756 (bytes or ?) duration = 109 seconds len/duration = 61/s

maplessssy avatar May 09 '23 02:05 maplessssy

4xA100, and it's speed,seems not use GPUS fulling, Is there a way to improve inference speed? thx image image

Command to start vicuna is : python3 -m fastchat.serve.model_worker --model-path /root/vicuna/vicuna13b --num-gpus 4

@vinvcn

jisheir avatar May 09 '23 06:05 jisheir

You can use other fast inference libraries like FasterTransformers. We will also soon release a high throughput batching backend.

merrymercy avatar May 20 '23 14:05 merrymercy

You can use other fast inference libraries like FasterTransformers. We will also soon release a high throughput batching backend.

Thanks for your work to improve this. Would you please give some hint about the idea of throughput batching for our analysis?

vinvcn avatar May 27 '23 13:05 vinvcn