CPU inference under 8gb(Q4/Q8).

Open VatsaDev opened this issue 1 year ago • 1 comments

@vikhyat, Hi, I'm trying to run moondream-2 on a raspberry PI5 for our robotics team, but CPU currently takes fp32, which sends ram usage to 9.1g (testing on kaggle), and we would max out at 7.5G, 500mb being used overall to run, and prefer to keep some space 6-7G and below. a Q4/Q8 version, which you mention on another issue #54 , even if low accuracy, would be great to use. Any instructions on running it/good quants?

Mar 07 '24 00:03 VatsaDev

You can load it in low precision by installing bitsandbytes and passing load_in_4bit=True when instantiating the model - this uses the quantization support built into the transformers library. But it's inaccurate enough that I don't really think it's useful. Working on getting better quantization going.

Mar 15 '24 01:03 vikhyat