Daniel Han

Results 781 comments of Daniel Han

Oh for inference on CPU only, please use transformers directly - sadly we don't support CPU

Ye use llama.cpp / GGUF for CPU inference

Another option is to run inference on the CPU with native transformers with ```python from peft import AutoPeftModelForCausalLM from transformers import AutoTokenizer model = AutoPeftModelForCausalLM.from_pretrained( "lora_model", # YOUR MODEL YOU...

Oh interesting I'll check this and get back to you - sorry!

Apologies I'll escalate this to higher priority - will try getting a fix for this

Hmmm weird - ill check this sorry on the issue

Oh you cannot use 4bit models - you must use `model.save_pretrained_merged` to 16bit then use vLLM

A good idea to use llama-cpp's Python module - ill make an example

Is this via Colab or Kaggle or local machines?