peft icon indicating copy to clipboard operation
peft copied to clipboard

How to save / form the config.json after fine-tuning - Flan T5 11b

Open sujithjoseph opened this issue 2 years ago • 46 comments

After fine-tuning a flan t5 11b model on custom data, I was saving the checkpoint via accelerate like this

        accelerator.wait_for_everyone()
        accelerator.save(
            get_peft_model_state_dict(model, state_dict=accelerator.get_state_dict(model)), checkpoint_name
        )
        accelerator.wait_for_everyone() 

It didnt create the config.json needed to load the model. The checkpoint got created (cdcFT5_lora.pt) ~ 19 MB file.

I am trying to create it manually using parameters that I used for training, looking at some sample lora model files, for inference purposes. Should target_modules be

"target_modules": [ "q", "v" ],

OR

"target_modules": [ "query_key_value" ],

{
  "base_model_name_or_path": "./cdcFT5_lora.pt",
  "bias": "none",
  "enable_lora": [
    true,
    false,
    true
  ],
  "fan_in_fan_out": true,
  "inference_mode": true,
  "lora_alpha": 32,
  "lora_dropout": 0.1,
  "merge_weights": false,
  "modules_to_save": null,
  "peft_type": "LORA",
  "r": 8,
  "target_modules": [
    "q",
    "v"
  ],
  "task_type": "SEQ_2_SEQ_LM"
}

What values should I give for "enable_lora": [ true, false, true ], "fan_in_fan_out": true,

For inference, should it be enable_lora as true and fan_in_fan_out as false?

How do I save the model with config.json directly as well?

Is it via

peft_model_id = f"{model_name_or_path}_{peft_config.peft_type}_{peft_config.task_type}"
accelerator.save_pretrained(peft_model_id)

I see model.save_pretrained() exists, not sure if this works as well - accelerator.save_pretrained(peft_model_id)

Anyway to load the checkpoint and create the config file as well, without a re-training?

sujithjoseph avatar Feb 15 '23 20:02 sujithjoseph

I was able to re-create the config file with a smaller data set training and then saved it using
finalmodel = accelerator.unwrap_model(model)

    finalmodel.save_pretrained(peft_model_id)

sujithjoseph avatar Feb 16 '23 01:02 sujithjoseph

how can i do inference easily using huggingface pipelines like this from a PeftModelForSeq2SeqLM model .

from transformers import pipeline

summarizer = pipeline("summarization", "cdcFT5lra", torch_dtype=torch.bfloat16)

raw_document = 'You must be 18 years old to live or work in New York State...'
prompt = "Summarize the following article in 10-20 words:"
results = summarizer(
        f"{prompt} \n\n {raw_document}",
        num_beams=5,
        min_length=5,
        no_repeat_ngram_size=3,
        truncation=True,
        max_length=512,
    )

OR

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
    input_ids, 
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
generate_simple(input_text)

doesnt work . Gives an error

TypeError: generate() takes 1 positional argument but 2 were given

PEFT examples, uses datasets as input for inference . Is that the only way ?

sujithjoseph avatar Feb 16 '23 01:02 sujithjoseph

Hello @sujithjoseph, for PEFT generate methods, one has to provide kwargs, could you try below change and let us know if that resolves the issue? Will add this point in caveats

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl")

def generate_simple(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
-   input_ids, 
+   input_ids=input_ids,
    max_length=1024,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

input_text = """Explain artificial intelligence"""
generate_simple(input_text)

pacman100 avatar Feb 16 '23 02:02 pacman100

Also, you can use it with Pipelines via the below logic, although a warning will be displayed mentioning model might be unsupported which can be ignored because PeftModel isn't subclass of models such as T5 ...:

from transformers import SummarizationPipeline


summarizer = SummarizationPipeline(model= model, tokenizer= tokenizer)

raw_document = 'You must be 18 years old to live or work in New York State...'
prompt = "Summarize the following article in 10-20 words:"
results = summarizer(
        f"{prompt} \n\n {raw_document}",
        num_beams=5,
        min_length=5,
        no_repeat_ngram_size=3,
        truncation=True,
        max_length=512,
    )

Let us know if above snippet helps in using pipeline

pacman100 avatar Feb 16 '23 02:02 pacman100

Thanks @pacman100 . Really Appreciate it! Had a follow up Q. I was trying to load the model with int-8


max_memory={0: "30GIB", 1: "0GIB", 2: "0GIB", 3: "0GIB", 4: "0GIB", "cpu":"60GB"}
config = PeftConfig.from_pretrained(peft_model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map="auto", max_memory=max_memory, load_in_8bit=True)
model = PeftModel.from_pretrained(model, peft_model_id, device_map="auto", max_memory=max_memory)

Got a runtime error RuntimeError: expected scalar type Half but found Float

By default, does it load up in bfloat16 or float16, if the model is trained in bfloat16?

sujithjoseph avatar Feb 16 '23 07:02 sujithjoseph

fine-tuned flan-t5-xxl takes around 10-20 seconds on a single 40 GB A100 GPU to give answer for a prompt.. If there anything than can be done it to make it faster w/o using a smaller flan-t5 model.

sujithjoseph avatar Feb 16 '23 07:02 sujithjoseph

Try running in bf16 instead of fp32. Also, you can look at ONNX/TensorRT

mayank31398 avatar Feb 16 '23 13:02 mayank31398

Had a follow up Q. I was trying to load the model with int-8

To load model trained using Accelerate+DeepSpeed ZeRO-3, you can do the following. Below is an example for 3B model:

+ from peft import prepare_model_for_training
  peft_model_id = "smangrul/twitter_complaints_bigscience_T0_3B_LORA_SEQ_2_SEQ_LM"
  config = PeftConfig.from_pretrained(peft_model_id)
  model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, 
             load_in_8bit=True, 
              device_map={'':0})
+ model = prepare_model_for_training(model)
  model = PeftModel.from_pretrained(model, peft_model_id)
  tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

Then running generate as usual:

%%time
model.eval()
inputs = tokenizer(f'{text_column} : Iphone battery sucks 11.2.1 @Apple Label : ', return_tensors="pt")
print(dataset["test"][i]["Tweet text"])
print(inputs)

with torch.no_grad():
    outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
    print(outputs)
    print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))
# ['complaint']

I ran below snippet in a jupyter cell for the following 3 settings:

from time import time
model.eval()
inputs = tokenizer(f'{text_column} : Iphone battery sucks 11.2.1 @Apple Label : ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)
  1. For fp32, load directly without using device_map if you have enough GPU memory: model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path)
  2. for bf16, post loading the PeftModel, do model.to(torch.bfloat16)
precision inference wall time (ms)
FP32 96
BF16 105
INT8 370

@mayank31398, BF16 is taking more time than FP32, that is peculiar, usually with Fp16 models, latency is reduced by half but here it is increasing. To make sure this isn't related to PEFT I just l loaded the pretrained LLM can still see the same behaviour with latency of BF16 being more compared to FP32.

pacman100 avatar Feb 16 '23 13:02 pacman100

@sujithjoseph, device_map and load_in_8bit are used for low resource inference when suppose you have GPU with VRAM that can't fit the entire model; device_map offloads it to CPU or across smaller GPUs; load_in_8bit aims to fir such large models on given GPU by having weights in int8 precision.

For very low latencies, as @mayank31398 suggested, you would have to convert the model to ONNX/TensorRT; alternatively use flash attention, fused kernels ...

pacman100 avatar Feb 16 '23 13:02 pacman100

Thanks a lot @pacman100 @mayank31398! , This has been really insightful! I didn't know that converting the model to Tensor RT and serve via TRT inference server, would be more faster than peft + deepspeed zero3 for inference.

sujithjoseph avatar Feb 16 '23 18:02 sujithjoseph

I also see quality issues on the fine-tuned flan-t5-xxl (on 500K records), unlike the original model. Its hallucinating a lot. I had used batch size as 1 , as I couldn't fit it for training on 8 40 GB A100s with batch size as 2 (it used to run for couple of hours and then go OOM) . and here are the train/eval ppl/loss epoch : 0 train_ppl : 133.7952117919922 train_epoch_loss : 4.896310329437256

eval_ppl : 1.5221441984176636 eval_epoch_loss : 0.4201200008392334

def generate_custom(input_text):
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
  output = model.generate(
     input_ids=input_ids, 
    min_length=256,
    max_new_tokens=1024,
    length_penalty=1.4,
    no_repeat_ngram_size=2,
    top_k=150,
    top_p=0.92,
    repetition_penalty=2.1,
    #num_beams=4,
    temperature=0.7)
  return tokenizer.decode(output[0], skip_special_tokens=True)

sujithjoseph avatar Feb 16 '23 18:02 sujithjoseph

8x 40G A100s should be enough for PEFT training of FLAN. Can you tell me what backend you are using? Are you not using DeepSpeed?

mayank31398 avatar Feb 16 '23 19:02 mayank31398

Yes. DeepSpeed zero 3. It worked fine with batch size as 1, not 2. I am concerned if lower batch size is impacting model quality. I had 500K records as training set. Here is my config (deepspeed / accelerate)

deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  bf16:enabled: true
compute_environment: LOCAL_MACHINE
deepspeed_config:
  gradient_accumulation_steps: 2
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
  bf16:enabled: true
distributed_type: DEEPSPEED
downcast_bf16: true
dynamo_backend: 'NO'
fsdp_config: {}
machine_rank: 0
main_training_function: main
megatron_lm_config: {}
mixed_precision: 'bf16'
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
use_cpu: false

reference - https://github.com/microsoft/DeepSpeed/issues/2820

sujithjoseph avatar Feb 16 '23 19:02 sujithjoseph

i only see 4 processes in the yaml ^^ you can always enable cpu offloading

mayank31398 avatar Feb 16 '23 19:02 mayank31398

@mayank31398 I had started with 4 and expanded to 8 . My final config has num proc as 8. Doesnt this enable cpu offoading?

  offload_optimizer_device: cpu
  offload_param_device: cpu

sujithjoseph avatar Feb 16 '23 19:02 sujithjoseph

I also had changed this in the final config - dynamo_backend: 'INDUCTOR'

sujithjoseph avatar Feb 16 '23 19:02 sujithjoseph

If I shard the xxl base model like this

model.save_pretrained("sharded", max_shard_size="2000MB")

will it help in then finetuning it with larger batch size or should I load it int-8 n and fine-tune it with larger batch size which fits in memory. Not sure which one will result in higher quality model.

sujithjoseph avatar Feb 16 '23 19:02 sujithjoseph

Since I have CUDA 11.6 driver installed (vertex ai), I was using torch 1.12.1+cu116 . During installation, I see this

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
peft 0.2.0.dev0 requires torch>=1.13.0, but you have torch 1.12.1+cu116 which is incompatible.

Does peft really need 1.13.0 version of torch?. So far, I havent seen any issues with 1.12.1+cu116 with peft

sujithjoseph avatar Feb 16 '23 19:02 sujithjoseph

@pacman100 , I am not able to import prepare_model_for_training from main. I did pip install -U git+https://github.com/huggingface/peft.git. Should I install this branch - https://github.com/huggingface/peft/tree/younesbelkada-flan-t5-xl ?

ImportError: cannot import name 'prepare_model_for_training' from 'peft' (/opt/conda/lib/python3.7/site-packages/peft/init.py) / I see it in https://github.com/huggingface/peft/blob/main/src/peft/utils/other.py . I see it in https://github.com/huggingface/peft/blob/main/src/peft/init.py as well. Probably need to uninstall and install again.

sujithjoseph avatar Feb 16 '23 20:02 sujithjoseph

pip install --upgrade -e git+https://github.com/huggingface/peft.git#egg=peft pip install --upgrade git+https://github.com/huggingface/peft.git

This helped to fix it.

sujithjoseph avatar Feb 16 '23 20:02 sujithjoseph

model.eval()
inputs = tokenizer(f'Explain Artificial Intelligence ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)

sujithjoseph avatar Feb 16 '23 20:02 sujithjoseph

from time import time
model.eval()
inputs = tokenizer(f'Explain Artificial Intelligence ', return_tensors="pt")
print(inputs)
times = [] #in ms

for i in range(100):
    with torch.no_grad():
        #with torch.cuda.amp.autocast():
        start = time()
        outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=10)
        times.append((time()-start)*1000)
print(outputs)
print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))

sum(times)/len(times)

Gives the below error AttributeError: 'NoneType' object has no attribute 'device'

─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>                                                                                      │
│                                                                                                  │
│    8 │   with torch.no_grad():                                                                   │
│    9 │   │   #with torch.cuda.amp.autocast():                                                    │
│   10 │   │   start = time()                                                                      │
│ ❱ 11 │   │   outputs = model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_token    │
│   12 │   │   times.append((time()-start)*1000)                                                   │
│   13 print(outputs)                                                                              │
│   14 print(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True))     │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/peft/peft_model.py:708 in generate                        │
│                                                                                                  │
│   705 │                                                                                          │
│   706 │   def generate(self, **kwargs):                                                          │
│   707 │   │   if not isinstance(self.peft_config, PromptLearningConfig):                         │
│ ❱ 708 │   │   │   return self.base_model.generate(**kwargs)                                      │
│   709 │   │   else:                                                                              │
│   710 │   │   │   if "input_ids" not in kwargs:                                                  │
│   711 │   │   │   │   raise ValueError("input_ids must be provided for Peft model generation")   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py:27 in decorate_context        │
│                                                                                                  │
│    24 │   │   @functools.wraps(func)                                                             │
│    25 │   │   def decorate_context(*args, **kwargs):                                             │
│    26 │   │   │   with self.clone():                                                             │
│ ❱  27 │   │   │   │   return func(*args, **kwargs)                                               │
│    28 │   │   return cast(F, decorate_context)                                                   │
│    29 │                                                                                          │
│    30 │   def _wrap_generator(self, func):                                                       │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:1248 in generate         │
│                                                                                                  │
│   1245 │   │   │   # if model is encoder decoder encoder_outputs are created                     │
│   1246 │   │   │   # and added to `model_kwargs`                                                 │
│   1247 │   │   │   model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(           │
│ ❱ 1248 │   │   │   │   inputs_tensor, model_kwargs, model_input_name                             │
│   1249 │   │   │   )                                                                             │
│   1250 │   │                                                                                     │
│   1251 │   │   # 5. Prepare `input_ids` which will be used for auto-regressive generation        │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/generation/utils.py:609 in                   │
│ _prepare_encoder_decoder_kwargs_for_generation                                                   │
│                                                                                                  │
│    606 │   │   model_input_name = model_input_name if model_input_name is not None else self.ma  │
│    607 │   │   encoder_kwargs["return_dict"] = True                                              │
│    608 │   │   encoder_kwargs[model_input_name] = inputs_tensor                                  │
│ ❱  609 │   │   model_kwargs["encoder_outputs"]: ModelOutput = encoder(**encoder_kwargs)          │
│    610 │   │                                                                                     │
│    611 │   │   return model_kwargs                                                               │
│    612                                                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:1075 in forward     │
│                                                                                                  │
│   1072 │   │   │   │   │   cross_attn_layer_head_mask=cross_attn_layer_head_mask,                │
│   1073 │   │   │   │   │   past_key_value=past_key_value,                                        │
│   1074 │   │   │   │   │   use_cache=use_cache,                                                  │
│ ❱ 1075 │   │   │   │   │   output_attentions=output_attentions,                                  │
│   1076 │   │   │   │   )                                                                         │
│   1077 │   │   │                                                                                 │
│   1078 │   │   │   # layer_outputs is a tuple with:                                              │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:692 in forward      │
│                                                                                                  │
│    689 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    690 │   │   │   past_key_value=self_attn_past_key_value,                                      │
│    691 │   │   │   use_cache=use_cache,                                                          │
│ ❱  692 │   │   │   output_attentions=output_attentions,                                          │
│    693 │   │   )                                                                                 │
│    694 │   │   hidden_states, present_key_value_state = self_attention_outputs[:2]               │
│    695 │   │   attention_outputs = self_attention_outputs[2:]  # Keep self-attention outputs an  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:599 in forward      │
│                                                                                                  │
│    596 │   │   │   layer_head_mask=layer_head_mask,                                              │
│    597 │   │   │   past_key_value=past_key_value,                                                │
│    598 │   │   │   use_cache=use_cache,                                                          │
│ ❱  599 │   │   │   output_attentions=output_attentions,                                          │
│    600 │   │   )                                                                                 │
│    601 │   │   hidden_states = hidden_states + self.dropout(attention_output[0])                 │
│    602 │   │   outputs = (hidden_states,) + attention_output[1:]  # add attentions if we output  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/transformers/models/t5/modeling_t5.py:511 in forward      │
│                                                                                                  │
│    508 │   │   │   return hidden_states                                                          │
│    509 │   │                                                                                     │
│    510 │   │   # get query states                                                                │
│ ❱  511 │   │   query_states = shape(self.q(hidden_states))  # (batch_size, n_heads, seq_length,  │
│    512 │   │                                                                                     │
│    513 │   │   # get key/value states                                                            │
│    514 │   │   key_states = project(                                                             │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py:1130 in _call_impl             │
│                                                                                                  │
│   1127 │   │   # this function, and just call forward.                                           │
│   1128 │   │   if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks o  │
│   1129 │   │   │   │   or _global_forward_hooks or _global_forward_pre_hooks):                   │
│ ❱ 1130 │   │   │   return forward_call(*input, **kwargs)                                         │
│   1131 │   │   # Do not call functions when jit is used                                          │
│   1132 │   │   full_backward_hooks, non_full_backward_hooks = [], []                             │
│   1133 │   │   if self._backward_hooks or _global_backward_hooks:                                │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/accelerate/hooks.py:158 in new_forward                    │
│                                                                                                  │
│   155 │   │   │   with torch.no_grad():                                                          │
│   156 │   │   │   │   output = old_forward(*args, **kwargs)                                      │
│   157 │   │   else:                                                                              │
│ ❱ 158 │   │   │   output = old_forward(*args, **kwargs)                                          │
│   159 │   │   return module._hf_hook.post_forward(module, output)                                │
│   160 │                                                                                          │
│   161 │   module.forward = new_forward                                                           │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/peft/tuners/lora.py:456 in forward                        │
│                                                                                                  │
│   453 │   │   │   │   nn.init.zeros_(self.lora_B.weight)                                         │
│   454 │   │                                                                                      │
│   455 │   │   def forward(self, x: torch.Tensor):                                                │
│ ❱ 456 │   │   │   result = super().forward(x)                                                    │
│   457 │   │   │   if self.r > 0:                                                                 │
│   458 │   │   │   │   result += self.lora_B(self.lora_A(self.lora_dropout(x))) * self.scaling    │
│   459 │   │   │   return result                                                                  │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/nn/modules.py:242 in forward                 │
│                                                                                                  │
│   239 │   │   if self.bias is not None and self.bias.dtype != x.dtype:                           │
│   240 │   │   │   self.bias.data = self.bias.data.to(x.dtype)                                    │
│   241 │   │                                                                                      │
│ ❱ 242 │   │   out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)                 │
│   243 │   │   if not self.state.has_fp16_weights:                                                │
│   244 │   │   │   if self.state.CB is not None and self.state.CxB is not None:                   │
│   245 │   │   │   │   # we converted 8-bit row major to turing/ampere format in the first infe   │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:488 in matmul         │
│                                                                                                  │
│   485 │   state = state or MatmulLtState()                                                       │
│   486 │   if threshold > 0.0:                                                                    │
│   487 │   │   state.threshold = threshold                                                        │
│ ❱ 488 │   return MatMul8bitLt.apply(A, B, out, bias, state)                                      │
│   489                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:320 in forward        │
│                                                                                                  │
│   317 │   │   │   │   │   state.CxB, state.SB = F.transform(state.CB, to_order=formatB)          │
│   318 │   │   else:                                                                              │
│   319 │   │   │   if not state.has_fp16_weights and state.CxB is None and using_igemmlt:         │
│ ❱ 320 │   │   │   │   state.CxB, state.SB = F.transform(state.CB, to_order=formatB)              │
│   321 │   │   │   subA = None                                                                    │
│   322 │   │                                                                                      │
│   323 │   │   # 2. Quantize B                                                                    │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/functional.py:1698 in transform              │
│                                                                                                  │
│   1695                                                                                           │
│   1696                                                                                           │
│   1697 def transform(A, to_order, from_order='row', out=None, transpose=False, state=None, ld=N  │
│ ❱ 1698 │   prev_device = pre_call(A.device)                                                      │
│   1699 │   if state is None: state = (A.shape, from_order)                                       │
│   1700 │   else: from_order = state[1]                                                           │
│   1701 │   if out is None: out, new_state = get_transform_buffer(state[0], A.dtype, A.device, t  │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
AttributeError: 'NoneType' object has no attribute 'device'

sujithjoseph avatar Feb 16 '23 20:02 sujithjoseph

This only happens when i load the model in 8-bit alone.

config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={'':0}, load_in_8bit=True,torch_dtype=torch.float16)
device = torch.device("cuda")
model.cuda()
model = prepare_model_for_training(model)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

either in 1 GPU or device:auto

sujithjoseph avatar Feb 16 '23 21:02 sujithjoseph

@sujithjoseph, what is the DeepSpeed version being used? PEFT require v0.8.0 as it has resolved bug related to training when lot of params are frozen.

pacman100 avatar Feb 17 '23 01:02 pacman100

This only happens when i load the model in 8-bit alone.

config = PeftConfig.from_pretrained(peft_model_id)

model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path, device_map={'':0}, load_in_8bit=True,torch_dtype=torch.float16)
device = torch.device("cuda")
model.cuda()
model = prepare_model_for_training(model)
model = PeftModel.from_pretrained(model, peft_model_id)
tokenizer = AutoTokenizer.from_pretrained(config.base_model_name_or_path)

either in 1 GPU or device:auto

Does adding device_map={'':0} to PeftModel.from_pretrained resolve the issues: model = PeftModel.from_pretrained(model, peft_model_id, device_map={'':0})

pacman100 avatar Feb 17 '23 04:02 pacman100

Also, may I know what is the input and output seq lengths of the dataset?

In my experiment on a summarization task using PEFT+DS_Z3 with 4 A100 GPUs,

  1. Input seq length = 255
  2. output seq length = 50
  3. batch_size_per_gpu = 8 (so total batch size of 32=8*4)

Code: https://gist.github.com/pacman100/c07b7c5b279543d0c1d164bf9c03967b

I observe below memory stats:

GPU Memory before entering the train : 954
GPU Memory consumed at the end of the train (end-begin): 1020
GPU Peak Memory consumed during the train (max-begin): 31323
GPU Total Peak Memory consumed during the train (max): 32277
CPU Memory before entering the train : 7361
CPU Memory consumed at the end of the train (end-begin): 1034
CPU Peak Memory consumed during the train (max-begin): 1034
CPU Total Peak Memory consumed during the train (max): 8395

So, works fine with a decent batch size. However, if input and output sequence lengths are very large, it might cause the OOM as activations from intermediate layers would become the bottleneck.

pacman100 avatar Feb 17 '23 04:02 pacman100

@sujithjoseph, what is the DeepSpeed version being used? PEFT require v0.8.0 as it has resolved bug related to training when lot of params are frozen.

@pacman100 deepspeed==0.8.0

sujithjoseph avatar Feb 17 '23 05:02 sujithjoseph

Also, may I know what is the input and output seq lengths of the dataset?

In my experiment on a summarization task using PEFT+DS_Z3 with 4 A100 GPUs,

  1. Input seq length = 255
  2. output seq length = 50
  3. batch_size_per_gpu = 8 (so total batch size of 32=8*4)

Code: https://gist.github.com/pacman100/c07b7c5b279543d0c1d164bf9c03967b

I observe below memory stats:

GPU Memory before entering the train : 954
GPU Memory consumed at the end of the train (end-begin): 1020
GPU Peak Memory consumed during the train (max-begin): 31323
GPU Total Peak Memory consumed during the train (max): 32277
CPU Memory before entering the train : 7361
CPU Memory consumed at the end of the train (end-begin): 1034
CPU Peak Memory consumed during the train (max-begin): 1034
CPU Total Peak Memory consumed during the train (max): 8395

So, works fine with a decent batch size. However, if input and output sequence lengths are very large, it might cause the OOM as activations from intermediate layers would become the bottleneck.

max length is 512 for both source and target.

sujithjoseph avatar Feb 17 '23 05:02 sujithjoseph

Thanks a lot , @pacman100 ! This is awesome! I will reduce max length for input seq length. I am trying to see if I can pass a Q and if Flan T5 can generate an answer/context summary.

sujithjoseph avatar Feb 17 '23 05:02 sujithjoseph

Does it help if I increase gradient accumulations steps to 4 from 1. Will it help in model accuracy, since I may be able to fit more batch size?

sujithjoseph avatar Feb 17 '23 05:02 sujithjoseph