starcoder
starcoder copied to clipboard
Does anyone run successfully with CPU only offline?
I try to run the model with a CPU-only python driving file but unfortunately always got failure on making some attemps. And here is my adapted file:
Attempt 1:
from transformers import AutoModelForCausalLM, AutoTokenizer ,BitsAndBytesConfig
checkpoint = "bigcode/starcoder"
device = "cpu" # for GPU usage or "cpu" for CPU usage
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto").to(device)
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
While I got following error messages:
ValueError: The current `device_map` had weights offloaded to the disk. Please
provide an `offload_folder` for them. Alternatively, make sure you have `safetensors`
installed if the model you are using offers the weights in this format.
Attempt 2:
Also I tried a huggingface quantization way reference by https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig , AutoConfig
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
device_map = {
"transformer.word_embeddings": 0,
"transformer.word_embeddings_layernorm": 0,
"lm_head": "cpu",
"transformer.h": 0,
"transformer.ln_f": 0,
}
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder")
model_8bit = AutoModelForCausalLM.from_pretrained(
"bigcode/starcoder",
device_map=device_map,
quantization_config=quantization_config,
)
print(f"Memory footprint: {model_8bit.get_memory_footprint() / 1e6:.2f} MB")
But got another error as well:
ValueError: transformer.wte.weight doesn't have any device set.
Some system env info:
ubuntu 18
python 3.8.5
torch1.10.1+cuda111
So I'm not sure if GPU is required for inference or how to properly config the device_map for the model, I wish someone who know about this could help me on this. Thanks.
I'm no expert here, but the error pretty much sounds like it's trying to offload to some sort of virtual memory. You did not say how much RAM (or what CPU) you got. "In FP32 the model requires more than 60GB of RAM, you can load it in FP16 or BF16 in ~30GB, or in 8bit under 20GB of RAM..." So even if you got the typical max 64GB RAM on consumer hw it might be not enuf. My 2c G.
Hi @ai-bits, Thank you for your suggestion, my CPU RAM is 48GB but I just have no idea on how to setup a lower bit and it seems not working for 8bit on CPU only scenario. I hope there could be some specific guidelines or examples for such kind of cases
You may want to read this blogpost to understand how to run large models with the help of accelerate. You can load model with the following code
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "bigcode/starcoder"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(
checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float32
)
The offload_folder
argument will make use of the disk if you don't have enough GPU and CPU RAM.
inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
Your output should look like
def print_hello_world():
print("Hello World")
def print_hello_
Hi @ArmelRandy, I think this is working on my HW but just take long time to load the model, so I'm going to think about add more GPU resources or use some other acceleration tool to speed up. Much thanks for your comments.
I try to run the model with a CPU-only python driving file but unfortunately always got failure on making some attemps. And here is my adapted file:
Attempt 1:
from transformers import AutoModelForCausalLM, AutoTokenizer ,BitsAndBytesConfig checkpoint = "bigcode/starcoder" device = "cpu" # for GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto").to(device) inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device) outputs = model.generate(inputs) print(tokenizer.decode(outputs[0]))
While I got following error messages:
ValueError: The current `device_map` had weights offloaded to the disk. Please provide an `offload_folder` for them. Alternatively, make sure you have `safetensors` installed if the model you are using offers the weights in this format.
Attempt 2:
Also I tried a huggingface quantization way reference by https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig , AutoConfig quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True) device_map = { "transformer.word_embeddings": 0, "transformer.word_embeddings_layernorm": 0, "lm_head": "cpu", "transformer.h": 0, "transformer.ln_f": 0, } tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder") model_8bit = AutoModelForCausalLM.from_pretrained( "bigcode/starcoder", device_map=device_map, quantization_config=quantization_config, ) print(f"Memory footprint: {model_8bit.get_memory_footprint() / 1e6:.2f} MB")
But got another error as well:
ValueError: transformer.wte.weight doesn't have any device set.
Some system env info:
ubuntu 18 python 3.8.5 torch1.10.1+cuda111
So I'm not sure if GPU is required for inference or how to properly config the device_map for the model, I wish someone who know about this could help me on this. Thanks.
The code below, I believe, does "device_map='auto'", change 'auto" to {"":'cpu'}