Unable to load model
(LLM beginner, I'm probably the issue but because I found no documentation except the ReadMe, I'm here)
The bug When trying to load a model with an exemple from the ReadMe, I only get an infinite loading time. Only one thread is used and nothing is loaded in the RAM. Nothing happens on the GPU. The model : https://huggingface.co/decapoda-research/llama-7b-hf
It took me a few tries to find a model that works but after a lot of error messages, I think I installed all the dependences that aren't installed at the same time as Guidance (Pytorch, Tensorflow, Transformers...).
To Reproduce
import guidance
guidance.llm = guidance.llms.Transformers(r"C:\path\to\models\llama-7b-hf", device=0)
System info:
- OS : Windows 10
- Guidance Version : 0.0.61
- Anaconda + Jupyter lab
- AMD Ryzen 5 5500
- AMD Radeon RX 6800 XT
https://huggingface.co/decapoda-research/llama-7b-hf has a bug in their token_config and doesn't work with AutoTokenizer, but you can load Llama models (fine-tuned variants also work but usually have some extra prompt syntax) using
import guidance
guidance.llm = guidance.llms.Transformers.LLaMA("decapoda-research/llama-7b-hf")
If nothing works, you can always load the model and tokenizer using transformers and pass it to guidance.llms.Transformers
import guidance
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("HuggingFace_repository")
model = AutoModelForCausalLM.from_pretrained("HuggingFace_repository")
guidance.llm = guidance.llms.Transformers(model=model, tokenizer=tokenizer)
Llama can take up some time to load (usually around a few minutes) and even longer for inference. Load the model onto GPU if possible or load a quantized version of the model (I think 🤗 accelerate provides the functionality to load HF models in 4-bit and 8-bit versions out of the box).