OnnxStream OnnxStream and TinyLlama?

OnnxStream and TinyLlama?

Open stl3 opened this issue 1 year ago • 1 comments

I was just wondering but would the methods used in OnnxStream further benefit a tiny language model like TinyLlama?. Just wanted to know how far resource usage could be brought down (I know it's not SD model but just wondering if the same could be applied on other types of models). TinyLlama uses about 550MB ram with the 4bit-quantized TinyLlama-1.1B's weight which seems quite enticing for lower end devices.

Sep 09 '23 09:09 stl3

hi,

It would be interesting to try running TinyLlama with OnnxStream but the problem would be latency. At the generation of each token, all weights would be read from disk again (1.1GB of data, using 8-bit quantization). This could be prevented by implementing a simple WeightsProvider that caches all the weights in RAM, but then the total memory consumption would be at the same level as other frameworks/libraries, making the use of OnnxStream meaningless. However it could be an interesting experiment :-)

Thanks, Vito

Message ID: @.***>

Sep 12 '23 07:09 vitoplantamura

OnnxStream OnnxStream copied to clipboard

OnnxStream and TinyLlama?

OnnxStream
OnnxStream copied to clipboard