grok-1
grok-1 copied to clipboard
Inference memory usage?
I've been playing around and noticed something interesting about how the model's memory gets used during inference. It looks like we're loading the model in FP16, but then, I saw that QuantizedWeight8bit
is being pulled into run.py without actually being put to work. Is this meant to help us with squeezing the model into FP16 weights on the fly, or was it supposed to be for using those nifty 8-bit quantized weights that we're not really taking advantage of?