David Corvoysier comments

Results 42 comments of


                                            David Corvoysier

Saving and loading quantized models doesn't work?

I think a helper taking a model and a quantized state_dict as parameters and returning the quantized model might be a good idea.

Saving and loading quantized models doesn't work?

OK, let me write an issue to explain a bit more what I expect.

Saving and loading quantized models doesn't work?

Here you go: https://github.com/huggingface/quanto/issues/162.

Saving and loading quantized models doesn't work?

The recommended way to save a quanto model is through a state_dict that can later be reloaded using `optimum.quanto.requantize`.

Saving and loading quantized models doesn't work?

A paragraph could be added to the README, for instance using `safetensors` for serializing the state_dict.

VGG19 not using any less VRAM in qint8/qfloat8 than in fp16

There is some code in the benchmark section that tracks device memory: https://github.com/huggingface/quanto/blob/main/bench/generation/metrics/latency.py ```python def get_device_memory(device): gc.collect() if device.type == "cuda": torch.cuda.empty_cache() return torch.cuda.memory_allocated() elif device.type == "mps": torch.mps.empty_cache() return...

David Corvoysier

Saving and loading quantized models doesn't work?

Saving and loading quantized models doesn't work?

Saving and loading quantized models doesn't work?

Saving and loading quantized models doesn't work?

Saving and loading quantized models doesn't work?

VGG19 not using any less VRAM in qint8/qfloat8 than in fp16

VGG19 not using any less VRAM in qint8/qfloat8 than in fp16

VGG19 not using any less VRAM in qint8/qfloat8 than in fp16

Can I use quanto on AMD GPU?

Potential readme issue - falls back to original dtype, not fp32