Daniel Han
Daniel Han
Oh for inference on CPU only, please use transformers directly - sadly we don't support CPU
Ye use llama.cpp / GGUF for CPU inference
Another option is to run inference on the CPU with native transformers with ```python from peft import AutoPeftModelForCausalLM from transformers import AutoTokenizer model = AutoPeftModelForCausalLM.from_pretrained( "lora_model", # YOUR MODEL YOU...
Oh interesting I'll check this and get back to you - sorry!
Apologies I'll escalate this to higher priority - will try getting a fix for this
Hmmm weird - ill check this sorry on the issue
Let me re prioritize this!
Oh you cannot use 4bit models - you must use `model.save_pretrained_merged` to 16bit then use vLLM
A good idea to use llama-cpp's Python module - ill make an example
Is this via Colab or Kaggle or local machines?