mobicham
mobicham
Oh, in the review, you don't see this https://github.com/mobiusml/hqq/pull/116/files/631ea011d8432b8a76518b0adc072574969d8771 ?
I just tried this one and it compiles without graph breaks: ```Python @torch.inference_mode() def optimize_weights_proximal_legacy( tensor: Tensor, scale: Tensor, zero: Tensor, min_max: list, axis: int = 0, device: Union[str, None]...
Works with `Quantizer.quantize` compiled as well! I suggest we do the following: -We remove the `@torch.compile` decorator and do the compilation outside, either like this ```Python Quantizer.optimize_weights = torch.compile(Quantizer.optimize_weights) ```...
Closing this PR since there was no update for over a year, but feel free to re-open another one please!
@void-main Trying your code above but there are a couple of issues: * `cached_bin` doesn't have `c_wrapper()` call * bin doesn't have attributes like `num_ctas`, `clusterDims`, etc. Do you have...
So I have been looking into this issue, I can confirm that @xinji1's solution does work to some extent. However calling`run()` directly is not compatible with torch.compile. Instead, I found...
``` pip install --upgrade git+https://github.com/huggingface/transformers.git pip install --upgrade git+https://github.com/mobiusml/hqq.git ```
Hi! What of quantization is GGUF using? If it's asymmetric quantization (with both scales/zeros) it could be converted
Thanks for sharing, looks like the logic is quite different, so I don't think both quantized outputs are compatible unfortunately.
It seems this is more of a transformers issue: it's not an official transformers model (`trust_remote=True`), so it's difficult to make sure everything would work fine. The model is actually...