starcoder Does anyone run successfully with CPU only offline?

trafficstars

I try to run the model with a CPU-only python driving file but unfortunately always got failure on making some attemps. And here is my adapted file:

Attempt 1:

from transformers import AutoModelForCausalLM, AutoTokenizer ,BitsAndBytesConfig

checkpoint = "bigcode/starcoder"
device = "cpu" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto").to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)

outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

While I got following error messages:

ValueError: The current `device_map` had weights offloaded to the disk. Please 
provide an `offload_folder` for them. Alternatively, make sure you have `safetensors`
installed if the model you are using offers the weights in this format.

Attempt 2:

Also I tried a huggingface quantization way reference by https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig , AutoConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder")

model_8bit = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder",
    device_map=device_map,
    quantization_config=quantization_config,
)

print(f"Memory footprint: {model_8bit.get_memory_footprint() / 1e6:.2f} MB")

But got another error as well:

ValueError: transformer.wte.weight doesn't have any device set.

Some system env info:

ubuntu 18
python 3.8.5
torch1.10.1+cuda111

So I'm not sure if GPU is required for inference or how to properly config the device_map for the model, I wish someone who know about this could help me on this. Thanks.

May 29 '23 12:05 ztxjack

I'm no expert here, but the error pretty much sounds like it's trying to offload to some sort of virtual memory. You did not say how much RAM (or what CPU) you got. "In FP32 the model requires more than 60GB of RAM, you can load it in FP16 or BF16 in ~30GB, or in 8bit under 20GB of RAM..." So even if you got the typical max 64GB RAM on consumer hw it might be not enuf. My 2c G.

May 29 '23 13:05 ai-bits

Hi @ai-bits, Thank you for your suggestion, my CPU RAM is 48GB but I just have no idea on how to setup a lower bit and it seems not working for 8bit on CPU only scenario. I hope there could be some specific guidelines or examples for such kind of cases

May 30 '23 00:05 ztxjack

You may want to read this blogpost to understand how to run large models with the help of accelerate. You can load model with the following code

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(
    checkpoint, device_map="auto", offload_folder="offload", offload_state_dict = True, torch_dtype=torch.float32
)

The offload_folder argument will make use of the disk if you don't have enough GPU and CPU RAM.

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt")
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Your output should look like

def print_hello_world():
    print("Hello World")

def print_hello_

May 30 '23 15:05 ArmelRandy

Hi @ArmelRandy, I think this is working on my HW but just take long time to load the model, so I'm going to think about add more GPU resources or use some other acceleration tool to speed up. Much thanks for your comments.

Jun 01 '23 03:06 ztxjack

I try to run the model with a CPU-only python driving file but unfortunately always got failure on making some attemps. And here is my adapted file:

Attempt 1:
from transformers import AutoModelForCausalLM, AutoTokenizer ,BitsAndBytesConfig

checkpoint = "bigcode/starcoder"
device = "cpu" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto").to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)

outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))
While I got following error messages:
ValueError: The current `device_map` had weights offloaded to the disk. Please 
provide an `offload_folder` for them. Alternatively, make sure you have `safetensors`
installed if the model you are using offers the weights in this format.
Attempt 2:

Also I tried a huggingface quantization way reference by https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig , AutoConfig

quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)

device_map = {
    "transformer.word_embeddings": 0,
    "transformer.word_embeddings_layernorm": 0,
    "lm_head": "cpu",
    "transformer.h": 0,
    "transformer.ln_f": 0,
}
tokenizer = AutoTokenizer.from_pretrained("bigcode/starcoder")

model_8bit = AutoModelForCausalLM.from_pretrained(
    "bigcode/starcoder",
    device_map=device_map,
    quantization_config=quantization_config,
)

print(f"Memory footprint: {model_8bit.get_memory_footprint() / 1e6:.2f} MB")
But got another error as well:

ValueError: transformer.wte.weight doesn't have any device set.

Some system env info:
ubuntu 18
python 3.8.5
torch1.10.1+cuda111
So I'm not sure if GPU is required for inference or how to properly config the device_map for the model, I wish someone who know about this could help me on this. Thanks.

The code below, I believe, does "device_map='auto'", change 'auto" to {"":'cpu'}

Jun 10 '23 15:06 phalexo

starcoder starcoder copied to clipboard

Does anyone run successfully with CPU only offline?

starcoder
starcoder copied to clipboard