mobicham
mobicham
You can test with this gist: https://gist.github.com/mobicham/701dd564c52590203ee09631425ad797
@ArthurZucker just a friendly reminder to review this PR when you have a moment. Let me know if you need any clarifications or if there’s anything I can help with....
@rohit-gupta thanks for flagging !
@blap is this related to the latest transformer changes? Otherwise, which hqq version causes this?
> > @blap is this related to the latest transformer changes? Otherwise, which hqq version causes this? > > I think so. I didn't had this problem in the release...
Any one from the HF team can track down this problem please? What changed ? Nothing on the hqq lib side changed much.
@blap why don't you use the latest release ? It works fine last time I tried (last week)
@blap `4.47.0` works for sure
Any timeline for this ? We would love to push a quantized version!
@naiveen what are you trying to optimize exactly? In practice, you need torch.compile / cuda graphs end-2-end in your model to optimize inference, because there's overhead to launch the Triton...