guidance icon indicating copy to clipboard operation
guidance copied to clipboard

Unable to load model

Open Picus303 opened this issue 2 years ago • 1 comments

(LLM beginner, I'm probably the issue but because I found no documentation except the ReadMe, I'm here)

The bug When trying to load a model with an exemple from the ReadMe, I only get an infinite loading time. Only one thread is used and nothing is loaded in the RAM. Nothing happens on the GPU. The model : https://huggingface.co/decapoda-research/llama-7b-hf

It took me a few tries to find a model that works but after a lot of error messages, I think I installed all the dependences that aren't installed at the same time as Guidance (Pytorch, Tensorflow, Transformers...).

To Reproduce

import guidance
guidance.llm = guidance.llms.Transformers(r"C:\path\to\models\llama-7b-hf", device=0)

System info:

  • OS : Windows 10
  • Guidance Version : 0.0.61
  • Anaconda + Jupyter lab
  • AMD Ryzen 5 5500
  • AMD Radeon RX 6800 XT

Picus303 avatar Jun 05 '23 00:06 Picus303

https://huggingface.co/decapoda-research/llama-7b-hf has a bug in their token_config and doesn't work with AutoTokenizer, but you can load Llama models (fine-tuned variants also work but usually have some extra prompt syntax) using

import guidance
guidance.llm = guidance.llms.Transformers.LLaMA("decapoda-research/llama-7b-hf")

If nothing works, you can always load the model and tokenizer using transformers and pass it to guidance.llms.Transformers

import guidance
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFace_repository")
model = AutoModelForCausalLM.from_pretrained("HuggingFace_repository")

guidance.llm = guidance.llms.Transformers(model=model, tokenizer=tokenizer)

Llama can take up some time to load (usually around a few minutes) and even longer for inference. Load the model onto GPU if possible or load a quantized version of the model (I think 🤗 accelerate provides the functionality to load HF models in 4-bit and 8-bit versions out of the box).

AbdullahWasTaken avatar Jun 06 '23 09:06 AbdullahWasTaken

In the new release we also support llama.cpp, which loads models much faster :)

marcotcr avatar Nov 14 '23 21:11 marcotcr