peft
peft copied to clipboard
How to save / form the config.json after fine-tuning - Flan T5 11b
After fine-tuning a flan t5 11b model on custom data, I was saving the checkpoint via accelerate like this
accelerator.wait_for_everyone()
accelerator.save(
get_peft_model_state_dict(model, state_dict=accelerator.get_state_dict(model)), checkpoint_name
)
accelerator.wait_for_everyone()
It didnt create the config.json needed to load the model. The checkpoint got created (cdcFT5_lora.pt) ~ 19 MB file.
I am trying to create it manually using parameters that I used for training, looking at some sample lora model files, for inference purposes. Should target_modules be
"target_modules": [ "q", "v" ],
OR
"target_modules": [ "query_key_value" ],
{
"base_model_name_or_path": "./cdcFT5_lora.pt",
"bias": "none",
"enable_lora": [
true,
false,
true
],
"fan_in_fan_out": true,
"inference_mode": true,
"lora_alpha": 32,
"lora_dropout": 0.1,
"merge_weights": false,
"modules_to_save": null,
"peft_type": "LORA",
"r": 8,
"target_modules": [
"q",
"v"
],
"task_type": "SEQ_2_SEQ_LM"
}
What values should I give for "enable_lora": [ true, false, true ], "fan_in_fan_out": true,
For inference, should it be enable_lora as true and fan_in_fan_out as false?
How do I save the model with config.json directly as well?
Is it via
peft_model_id = f"{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}"
accelerator.save_pretrained(peft_model_id)
I see model.save_pretrained() exists, not sure if this works as well - accelerator.save_pretrained(peft_model_id)
Anyway to load the checkpoint and create the config file as well, without a re-training?
I was able to re-create the config file with a smaller data set training and then saved it using
finalmodel = accelerator.unwrap_model(model)
finalmodel.save_pretrained(peft_model_id)
how can i do inference easily using huggingface pipelines like this from a PeftModelForSeq2SeqLM model .
from transformers import pipeline
summarizer = pipeline("summarization", "cdcFT5lra", torch_dtype=torch.bfloat16)
raw_document = 'You must be 18 years old to live or work in New York State...'
prompt = "Summarize the following article in 10-20 words:"
results = summarizer(
f"{prompt} \n\n {raw_document}",
num_beams=5,
min_length=5,
no_repeat_ngram_size=3,
truncation=True,
max_length=512,
)
OR
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
def generate_simple(input_text):
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
output = model.generate(
input_ids,
max_length=1024,
temperature=0.7)
return tokenizer.decode(output[0], skip_special_tokens=True)
input_text = """Explain artificial intelligence"""
generate_simple(input_text)
doesnt work . Gives an error
TypeError: generate() takes 1 positional argument but 2 were given
PEFT examples, uses datasets as input for inference . Is that the only way ?
Hello @sujithjoseph, for PEFT generate
methods, one has to provide kwargs, could you try below change and let us know if that resolves the issue? Will add this point in caveats
from transformers import T5Tokenizer, T5ForConditionalGeneration
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")
def generate_simple(input_text):
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
output = model.generate(
- input_ids,
+ input_ids=input_ids,
max_length=1024,
temperature=0.7)
return tokenizer.decode(output[0], skip_special_tokens=True)
input_text = """Explain artificial intelligence"""
generate_simple(input_text)
Also, you can use it with Pipelines via the below logic, although a warning will be displayed mentioning model might be unsupported which can be ignored because PeftModel isn't subclass of models such as T5 ...:
from transformers import SummarizationPipeline
summarizer = SummarizationPipeline(model= model, tokenizer= tokenizer)
raw_document = 'You must be 18 years old to live or work in New York State...'
prompt = "Summarize the following article in 10-20 words:"
results = summarizer(
f"{prompt} \n\n {raw_document}",
num_beams=5,
min_length=5,
no_repeat_ngram_size=3,
truncation=True,
max_length=512,
)
Let us know if above snippet helps in using pipeline
Thanks @pacman100 . Really Appreciate it! Had a follow up Q. I was trying to load the model with int-8
max_memory={0: "30GIB", 1: "0GIB", 2: "0GIB", 3: "0GIB", 4: "0GIB", "cpu":"60GB"}
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", max_memory=max_memory, load_in_8bit=True)
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto", max_memory=max_memory)
Got a runtime error RuntimeError: expected scalar type Half but found Float
By default, does it load up in bfloat16 or float16, if the model is trained in bfloat16?
fine-tuned flan-t5-xxl takes around 10-20 seconds on a single 40 GB A100 GPU to give answer for a prompt.. If there anything than can be done it to make it faster w/o using a smaller flan-t5 model.
Try running in bf16 instead of fp32. Also, you can look at ONNX/TensorRT
Had a follow up Q. I was trying to load the model with int-8
To load model trained using Accelerate+DeepSpeed ZeRO-3, you can do the following. Below is an example for 3B model:
+ from peft import prepare_model_for_training
peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path,
load_in_8bit=True,
device_map={'':0})
+ model = prepare_model_for_training(model)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
Then running generate
as usual:
%%time
model.eval()
inputs = tokenizer(f'{text_column} : Iphone battery sucks 11.2.1 @Apple Label : ', return_tensors="pt")
print(dataset["test"][i]["Tweet text"])
print(inputs)
with torch.no_grad():
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
# ['complaint']
I ran below snippet in a jupyter cell for the following 3 settings:
from time import time
model.eval()
inputs = tokenizer(f'{text_column} : Iphone battery sucks 11.2.1 @Apple Label : ', return_tensors="pt")
print(inputs)
times = [] #in ms
for i in range(100):
with torch.no_grad():
#with torch.cuda.amp.autocast():
start = time()
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
sum(times)/len(times)
- For fp32, load directly without using
device_map
if you have enough GPU memory:model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
- for bf16, post loading the PeftModel, do
model.to(torch.bfloat16)
precision | inference wall time (ms) |
---|---|
FP32 | 96 |
BF16 | 105 |
INT8 | 370 |
@mayank31398, BF16
is taking more time than FP32, that is peculiar, usually with Fp16
models, latency is reduced by half but here it is increasing. To make sure this isn't related to PEFT
I just l loaded the pretrained LLM can still see the same behaviour with latency of BF16 being more compared to FP32.
@sujithjoseph, device_map
and load_in_8bit
are used for low resource inference when suppose you have GPU with VRAM that can't fit the entire model; device_map
offloads it to CPU or across smaller GPUs; load_in_8bit
aims to fir such large models on given GPU by having weights in int8 precision.
For very low latencies, as @mayank31398 suggested, you would have to convert the model to ONNX/TensorRT; alternatively use flash attention, fused kernels ...
Thanks a lot @pacman100 @mayank31398! , This has been really insightful! I didn't know that converting the model to Tensor RT and serve via TRT inference server, would be more faster than peft + deepspeed zero3 for inference.
I also see quality issues on the fine-tuned flan-t5-xxl (on 500K records), unlike the original model. Its hallucinating a lot. I had used batch size as 1 , as I couldn't fit it for training on 8 40 GB A100s with batch size as 2 (it used to run for couple of hours and then go OOM) . and here are the train/eval ppl/loss epoch : 0 train_ppl : 133.7952117919922 train_epoch_loss : 4.896310329437256
eval_ppl : 1.5221441984176636 eval_epoch_loss : 0.4201200008392334
def generate_custom(input_text):
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
output = model.generate(
input_ids=input_ids,
min_length=256,
max_new_tokens=1024,
length_penalty=1.4,
no_repeat_ngram_size=2,
top_k=150,
top_p=0.92,
repetition_penalty=2.1,
#num_beams=4,
temperature=0.7)
return tokenizer.decode(output[0], skip_special_tokens=True)
8x 40G A100s should be enough for PEFT training of FLAN. Can you tell me what backend you are using? Are you not using DeepSpeed?
Yes. DeepSpeed zero 3. It worked fine with batch size as 1, not 2. I am concerned if lower batch size is impacting model quality. I had 500K records as training set. Here is my config (deepspeed / accelerate)
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
bf16:enabled: true
compute_environment: LOCAL_MACHINE
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: cpu
offload_param_device: cpu
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
bf16:enabled: true
distributed_type: DEEPSPEED
downcast_bf16: true
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false
reference - https://github.com/microsoft/DeepSpeed/issues/2820
i only see 4 processes in the yaml ^^ you can always enable cpu offloading
@mayank31398 I had started with 4 and expanded to 8 . My final config has num proc as 8. Doesnt this enable cpu offoading?
offload_optimizer_device: cpu
offload_param_device: cpu
I also had changed this in the final config - dynamo_backend: 'INDUCTOR'
If I shard the xxl base model like this
model.save_pretrained("sharded", max_shard_size="2000MB")
will it help in then finetuning it with larger batch size or should I load it int-8 n and fine-tune it with larger batch size which fits in memory. Not sure which one will result in higher quality model.
Since I have CUDA 11.6 driver installed (vertex ai), I was using torch 1.12.1+cu116 . During installation, I see this
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
peft 0.2.0.dev0 requires torch>=1.13.0, but you have torch 1.12.1+cu116 which is incompatible.
Does peft really need 1.13.0 version of torch?. So far, I havent seen any issues with 1.12.1+cu116 with peft
@pacman100 , I am not able to import prepare_model_for_training from main. I did pip install -U git+https://github.com/huggingface/peft.git. Should I install this branch - https://github.com/huggingface/peft/tree/younesbelkada-flan-t5-xl ?
ImportError: cannot import name 'prepare_model_for_training' from 'peft' (/opt/conda/lib/python3.7/site-packages/peft/init.py) / I see it in https://github.com/huggingface/peft/blob/main/src/peft/utils/other.py . I see it in https://github.com/huggingface/peft/blob/main/src/peft/init.py as well. Probably need to uninstall and install again.
pip install --upgrade -e git+https://github.com/huggingface/peft.git#egg=peft pip install --upgrade git+https://github.com/huggingface/peft.git
This helped to fix it.
model.eval()
inputs = tokenizer(f'Explain Artificial Intelligence ', return_tensors="pt")
print(inputs)
times = [] #in ms
for i in range(100):
with torch.no_grad():
#with torch.cuda.amp.autocast():
start = time()
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
sum(times)/len(times)
from time import time
model.eval()
inputs = tokenizer(f'Explain Artificial Intelligence ', return_tensors="pt")
print(inputs)
times = [] #in ms
for i in range(100):
with torch.no_grad():
#with torch.cuda.amp.autocast():
start = time()
outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
sum(times)/len(times)
Gives the below error AttributeError: 'NoneType' object has no attribute 'device'
─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module> │
│ │
│ 8 │ with torch.no_grad(): │
│ 9 │ │ #with torch.cuda.amp.autocast(): │
│ 10 │ │ start = time() │
│ ❱ 11 │ │ outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_token │
│ 12 │ │ times.append((time()-start)*1000) │
│ 13 print(outputs) │
│ 14 print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)) │
│ │
│ /opt/conda/lib/python3.7/site-packages/peft/peft_model.py:708 in generate │
│ │
│ 705 │ │
│ 706 │ def generate(self, **kwargs): │
│ 707 │ │ if not isinstance(self.peft_config, PromptLearningConfig): │
│ ❱ 708 │ │ │ return self.base_model.generate(**kwargs) │
│ 709 │ │ else: │
│ 710 │ │ │ if "input_ids" not in kwargs: │
│ 711 │ │ │ │ raise ValueError("input_ids must be provided for Peft model generation") │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py:27 in decorate_context │
│ │
│ 24 │ │ @functools.wraps(func) │
│ 25 │ │ def decorate_context(*args, **kwargs): │
│ 26 │ │ │ with self.clone(): │
│ ❱ 27 │ │ │ │ return func(*args, **kwargs) │
│ 28 │ │ return cast(F, decorate_context) │
│ 29 │ │
│ 30 │ def _wrap_generator(self, func): │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:1248 in generate │
│ │
│ 1245 │ │ │ # if model is encoder decoder encoder_outputs are created │
│ 1246 │ │ │ # and added to `model_kwargs` │
│ 1247 │ │ │ model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation( │
│ ❱ 1248 │ │ │ │ inputs_tensor, model_kwargs, model_input_name │
│ 1249 │ │ │ ) │
│ 1250 │ │ │
│ 1251 │ │ # 5. Prepare `input_ids` which will be used for auto-regressive generation │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:609 in │
│ _prepare_encoder_decoder_kwargs_for_generation │
│ │
│ 606 │ │ model_input_name = model_input_name if model_input_name is not None else self.ma │
│ 607 │ │ encoder_kwargs["return_dict"] = True │
│ 608 │ │ encoder_kwargs[model_input_name] = inputs_tensor │
│ ❱ 609 │ │ model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs) │
│ 610 │ │ │
│ 611 │ │ return model_kwargs │
│ 612 │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1130 │ │ │ return forward_call(*input, **kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:1075 in forward │
│ │
│ 1072 │ │ │ │ │ cross_attn_layer_head_mask=cross_attn_layer_head_mask, │
│ 1073 │ │ │ │ │ past_key_value=past_key_value, │
│ 1074 │ │ │ │ │ use_cache=use_cache, │
│ ❱ 1075 │ │ │ │ │ output_attentions=output_attentions, │
│ 1076 │ │ │ │ ) │
│ 1077 │ │ │ │
│ 1078 │ │ │ # layer_outputs is a tuple with: │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1130 │ │ │ return forward_call(*input, **kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward │
│ │
│ 155 │ │ │ with torch.no_grad(): │
│ 156 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 157 │ │ else: │
│ ❱ 158 │ │ │ output = old_forward(*args, **kwargs) │
│ 159 │ │ return module._hf_hook.post_forward(module, output) │
│ 160 │ │
│ 161 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:692 in forward │
│ │
│ 689 │ │ │ layer_head_mask=layer_head_mask, │
│ 690 │ │ │ past_key_value=self_attn_past_key_value, │
│ 691 │ │ │ use_cache=use_cache, │
│ ❱ 692 │ │ │ output_attentions=output_attentions, │
│ 693 │ │ ) │
│ 694 │ │ hidden_states, present_key_value_state = self_attention_outputs[:2] │
│ 695 │ │ attention_outputs = self_attention_outputs[2:] # Keep self-attention outputs an │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1130 │ │ │ return forward_call(*input, **kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward │
│ │
│ 155 │ │ │ with torch.no_grad(): │
│ 156 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 157 │ │ else: │
│ ❱ 158 │ │ │ output = old_forward(*args, **kwargs) │
│ 159 │ │ return module._hf_hook.post_forward(module, output) │
│ 160 │ │
│ 161 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:599 in forward │
│ │
│ 596 │ │ │ layer_head_mask=layer_head_mask, │
│ 597 │ │ │ past_key_value=past_key_value, │
│ 598 │ │ │ use_cache=use_cache, │
│ ❱ 599 │ │ │ output_attentions=output_attentions, │
│ 600 │ │ ) │
│ 601 │ │ hidden_states = hidden_states + self.dropout(attention_output[0]) │
│ 602 │ │ outputs = (hidden_states,) + attention_output[1:] # add attentions if we output │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1130 │ │ │ return forward_call(*input, **kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward │
│ │
│ 155 │ │ │ with torch.no_grad(): │
│ 156 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 157 │ │ else: │
│ ❱ 158 │ │ │ output = old_forward(*args, **kwargs) │
│ 159 │ │ return module._hf_hook.post_forward(module, output) │
│ 160 │ │
│ 161 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:511 in forward │
│ │
│ 508 │ │ │ return hidden_states │
│ 509 │ │ │
│ 510 │ │ # get query states │
│ ❱ 511 │ │ query_states = shape(self.q(hidden_states)) # (batch_size, n_heads, seq_length, │
│ 512 │ │ │
│ 513 │ │ # get key/value states │
│ 514 │ │ key_states = project( │
│ │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl │
│ │
│ 1127 │ │ # this function, and just call forward. │
│ 1128 │ │ if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o │
│ 1129 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │
│ ❱ 1130 │ │ │ return forward_call(*input, **kwargs) │
│ 1131 │ │ # Do not call functions when jit is used │
│ 1132 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │
│ 1133 │ │ if self._backward_hooks or _global_backward_hooks: │
│ │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward │
│ │
│ 155 │ │ │ with torch.no_grad(): │
│ 156 │ │ │ │ output = old_forward(*args, **kwargs) │
│ 157 │ │ else: │
│ ❱ 158 │ │ │ output = old_forward(*args, **kwargs) │
│ 159 │ │ return module._hf_hook.post_forward(module, output) │
│ 160 │ │
│ 161 │ module.forward = new_forward │
│ │
│ /opt/conda/lib/python3.7/site-packages/peft/tuners/lora.py:456 in forward │
│ │
│ 453 │ │ │ │ nn.init.zeros_(self.lora_B.weight) │
│ 454 │ │ │
│ 455 │ │ def forward(self, x: torch.Tensor): │
│ ❱ 456 │ │ │ result = super().forward(x) │
│ 457 │ │ │ if self.r > 0: │
│ 458 │ │ │ │ result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling │
│ 459 │ │ │ return result │
│ │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/nn/modules.py:242 in forward │
│ │
│ 239 │ │ if self.bias is not None and self.bias.dtype != x.dtype: │
│ 240 │ │ │ self.bias.data = self.bias.data.to(x.dtype) │
│ 241 │ │ │
│ ❱ 242 │ │ out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) │
│ 243 │ │ if not self.state.has_fp16_weights: │
│ 244 │ │ │ if self.state.CB is not None and self.state.CxB is not None: │
│ 245 │ │ │ │ # we converted 8-bit row major to turing/ampere format in the first infe │
│ │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:488 in matmul │
│ │
│ 485 │ state = state or MatmulLtState() │
│ 486 │ if threshold > 0.0: │
│ 487 │ │ state.threshold = threshold │
│ ❱ 488 │ return MatMul8bitLt.apply(A, B, out, bias, state) │
│ 489 │
│ │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:320 in forward │
│ │
│ 317 │ │ │ │ │ state.CxB, state.SB = F.transform(state.CB, to_order=formatB) │
│ 318 │ │ else: │
│ 319 │ │ │ if not state.has_fp16_weights and state.CxB is None and using_igemmlt: │
│ ❱ 320 │ │ │ │ state.CxB, state.SB = F.transform(state.CB, to_order=formatB) │
│ 321 │ │ │ subA = None │
│ 322 │ │ │
│ 323 │ │ # 2. Quantize B │
│ │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/functional.py:1698 in transform │
│ │
│ 1695 │
│ 1696 │
│ 1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=N │
│ ❱ 1698 │ prev_device = pre_call(A.device) │
│ 1699 │ if state is None: state = (A.shape, from_order) │
│ 1700 │ else: from_order = state[1] │
│ 1701 │ if out is None: out, new_state = get_transform_buffer(state[0], A.dtype, A.device, t │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'device'
This only happens when i load the model in 8-bit alone.
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={'':0}, load_in_8bit=True,torch_dtype=torch.float16)
device = torch.device("cuda")
model.cuda()
model = prepare_model_for_training(model)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
either in 1 GPU or device:auto
@sujithjoseph, what is the DeepSpeed version being used? PEFT require v0.8.0 as it has resolved bug related to training when lot of params are frozen.
This only happens when i load the model in 8-bit alone.
config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={'':0}, load_in_8bit=True,torch_dtype=torch.float16) device = torch.device("cuda") model.cuda() model = prepare_model_for_training(model) model = PeftModel.from_pretrained(model, peft_model_id) tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)
either in 1 GPU or device:auto
Does adding device_map={'':0}
to PeftModel.from_pretrained
resolve the issues: model = PeftModel.from_pretrained(model, peft_model_id, device_map={'':0})
Also, may I know what is the input and output seq lengths of the dataset?
In my experiment on a summarization task using PEFT+DS_Z3 with 4 A100 GPUs,
- Input seq length = 255
- output seq length = 50
- batch_size_per_gpu = 8 (so total batch size of 32=8*4)
Code: https://gist.github.com/pacman100/c07b7c5b279543d0c1d164bf9c03967b
I observe below memory stats:
GPU Memory before entering the train : 954
GPU Memory consumed at the end of the train (end-begin): 1020
GPU Peak Memory consumed during the train (max-begin): 31323
GPU Total Peak Memory consumed during the train (max): 32277
CPU Memory before entering the train : 7361
CPU Memory consumed at the end of the train (end-begin): 1034
CPU Peak Memory consumed during the train (max-begin): 1034
CPU Total Peak Memory consumed during the train (max): 8395
So, works fine with a decent batch size. However, if input and output sequence lengths are very large, it might cause the OOM as activations from intermediate layers would become the bottleneck.
@sujithjoseph, what is the DeepSpeed version being used? PEFT require v0.8.0 as it has resolved bug related to training when lot of params are frozen.
@pacman100 deepspeed==0.8.0
Also, may I know what is the input and output seq lengths of the dataset?
In my experiment on a summarization task using PEFT+DS_Z3 with 4 A100 GPUs,
- Input seq length = 255
- output seq length = 50
- batch_size_per_gpu = 8 (so total batch size of 32=8*4)
Code: https://gist.github.com/pacman100/c07b7c5b279543d0c1d164bf9c03967b
I observe below memory stats:
GPU Memory before entering the train : 954 GPU Memory consumed at the end of the train (end-begin): 1020 GPU Peak Memory consumed during the train (max-begin): 31323 GPU Total Peak Memory consumed during the train (max): 32277 CPU Memory before entering the train : 7361 CPU Memory consumed at the end of the train (end-begin): 1034 CPU Peak Memory consumed during the train (max-begin): 1034 CPU Total Peak Memory consumed during the train (max): 8395
So, works fine with a decent batch size. However, if input and output sequence lengths are very large, it might cause the OOM as activations from intermediate layers would become the bottleneck.
max length is 512 for both source and target.
Thanks a lot , @pacman100 ! This is awesome! I will reduce max length for input seq length. I am trying to see if I can pass a Q and if Flan T5 can generate an answer/context summary.
Does it help if I increase gradient accumulations steps to 4 from 1. Will it help in model accuracy, since I may be able to fit more batch size?