OnnxStream icon indicating copy to clipboard operation
OnnxStream copied to clipboard

OnnxStream and TinyLlama?

Open stl3 opened this issue 1 year ago • 1 comments

I was just wondering but would the methods used in OnnxStream further benefit a tiny language model like TinyLlama?. Just wanted to know how far resource usage could be brought down (I know it's not SD model but just wondering if the same could be applied on other types of models). TinyLlama uses about 550MB ram with the 4bit-quantized TinyLlama-1.1B's weight which seems quite enticing for lower end devices.

stl3 avatar Sep 09 '23 09:09 stl3

hi,

It would be interesting to try running TinyLlama with OnnxStream but the problem would be latency. At the generation of each token, all weights would be read from disk again (1.1GB of data, using 8-bit quantization). This could be prevented by implementing a simple WeightsProvider that caches all the weights in RAM, but then the total memory consumption would be at the same level as other frameworks/libraries, making the use of OnnxStream meaningless. However it could be an interesting experiment :-)

Thanks, Vito

Message ID: @.***>

vitoplantamura avatar Sep 12 '23 07:09 vitoplantamura