mobicham
mobicham
You can select the GPUs you want to use via `CUDA_VISIBLE_DEVICES=0 ipython3` What model and GPUs are you trying to use ? If you want to use multi-gpu runtime, I...
Can you please share a code snippet of what model you are trying to use and your system settings (what gpus does your machine have?)
Strange, try this: ```Python import torch from transformers import AutoTokenizer from hqq.models.hf.base import AutoHQQHFModel from hqq.utils.patching import * from hqq.core.quantize import * from hqq.utils.generation_hf import HFGenerator #Load the model ###################################################...
Thanks! It should be similar to `.cuda()` but instead would use `.to('cpu')`: https://github.com/mobiusml/hqq/blob/b1a7c0698b2c323bfa55a2b4a110c8f3636fade7/hqq/core/quantize.py#L472-L535 RIght now it is a mess because we support quantizing the scale/zero values and support offloading them...
This is an old issue, already resolved.
Thanks a lot for the effort @fahadh4ilyas ! That is correct, as a temporary solution, there's this patching functions that adds a quant_config: https://github.com/mobiusml/hqq/blob/master/hqq/utils/patching.py#L29 There's an easy way to do...
Yeah I thought about it, but it will make things even more complicated, since it will require more work on the `transformers` lib side. Putting everything in `state_dict` simplifies the...
hqq's `save_quantized` wouldn't require changes in transformers that's correct, but the goal is to have official serialization support with HF transformers directly, so we would be able to save models...
I also tried loading a model saved with the previous version (https://huggingface.co/mobiuslabsgmbh/Llama-2-7b-chat-hf_4bitnogs_hqq) and it worked without any issue, which is good news for backward compatibility. Now we just need to...
Draft pull request here: https://github.com/huggingface/transformers/pull/32056