David Corvoysier

Results 42 comments of David Corvoysier

I think a helper taking a model and a quantized state_dict as parameters and returning the quantized model might be a good idea.

OK, let me write an issue to explain a bit more what I expect.

Here you go: https://github.com/huggingface/quanto/issues/162.

The recommended way to save a quanto model is through a state_dict that can later be reloaded using `optimum.quanto.requantize`.

A paragraph could be added to the README, for instance using `safetensors` for serializing the state_dict.

There is some code in the benchmark section that tracks device memory: https://github.com/huggingface/quanto/blob/main/bench/generation/metrics/latency.py ```python def get_device_memory(device): gc.collect() if device.type == "cuda": torch.cuda.empty_cache() return torch.cuda.memory_allocated() elif device.type == "mps": torch.mps.empty_cache() return...

> This is odd though, since in Task Manager, GPU VRAM is the same in both cases. The measurement provided by pytorch is the one you should trust. > Also,...

I see, my mistake: the numbers reported by pytorch are for quantized weights only, without any activations. When you pass images through the model, large activation buffers are also allocated,...

quanto does not use very fancy CUDA kernels, so I don't see any reason why it wouldn't work. Just give it a try and please report your feedback.

I agree this is outdated. What actually happens for matrix multiplications is that the tensors are dequantized back to their original type, except if both tensors are int8. For int8...