grok-1 Inference memory usage?

Inference memory usage?

Open martintomov opened this issue 11 months ago • 0 comments

I've been playing around and noticed something interesting about how the model's memory gets used during inference. It looks like we're loading the model in FP16, but then, I saw that QuantizedWeight8bit is being pulled into run.py without actually being put to work. Is this meant to help us with squeezing the model into FP16 weights on the fly, or was it supposed to be for using those nifty 8-bit quantized weights that we're not really taking advantage of?

Mar 18 '24 09:03 martintomov

grok-1 grok-1 copied to clipboard

Inference memory usage?

grok-1
grok-1 copied to clipboard