ColossalAI [BUG]: loading OPT 66B model

Is there an existing issue for this bug?

[X] I have searched the existing issues

🐛 Describe the bug

I am trying to reproduce OPT-66B using 16xH100 (2 servers). Each server has CPU memory of 1000 GiB. when I try running OPT benchmarking, I see program crashes with following error and by observing CPU memory, it reaches to 924 GiB. How can I to run OPT-66B benchmark with mentioned resources?

error

WARNING:torch.distributed.run:                                                                                                                                        
*****************************************                                                                                                                             
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal 
performance in your application as needed.                                                                                                                            
*****************************************                                                                                                                             
/usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.                                     
  warnings.warn("`config` is deprecated and will be removed soon.")                                                                                                   
[06/25/24 19:04:54] INFO     colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:67 launch                                
[06/25/24 19:04:55] INFO     colossalai - colossalai - INFO: Distributed environment is initialized, world size: 16                                                   
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51974 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51975 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51976 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51977 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51978 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51980 closing signal SIGTERM                                                                    
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51981 closing signal SIGTERM                                                                    
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 51979) of binary: /usr/bin/python                                       
Traceback (most recent call last):                                                                                                                                    
  File "/usr/local/bin/torchrun", line 33, in <module>                                                                                                                
    sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())                                                                            
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper                                    
    return f(*args, **kwargs)                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main                                                                           
    run(args)                                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run                                                                            
    elastic_launch(                                                                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__                                                              
    return launch_agent(self._config, self._entrypoint, list(args))                                                                                                   
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent                                                          
    raise ChildFailedError(                                                                                                                                           
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:                                                                                                    
======================================================                                                                                                                
opt/opt_train_demo.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):

Environment

Docker image : nvcr.io/nvidia/pytorch:23.02-py3 transformers : 4.33 colossalai : 0.3.6

Jun 25 '24 19:06 PurvangL

You can try the lazy init as in here and file in a PR if it works. https://github.com/hpcaitech/ColossalAI/blob/8e718a1421203e0f5607f477e1a998567c70d123/examples/language/llama/benchmark.py#L245

Jun 27 '24 03:06 Edenzzzz

Thanks @Edenzzzz for suggestion. I will try. I also have one more question. During evaluation of OPT, eval loss for model trained with hybrid_parallel plugin is 5x higher than gemini plugin. and it's like this for most of the OPT variants. Do you know why?

Jun 28 '24 22:06 PurvangL

@Edenzzzz further evaluating llama2, I see similar pattern where I see eval loss for model fine tuned with hybrid_parallel plugin is 5x higher than other plugins.

am I missing anything while evaluating model fine tuned with Hybrid Parallel plugin?

Eval script

import argparse
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTJForCausalLM, LlamaForCausalLM, OPTForCausalLM
import numpy as np
import evaluate
from dataset_util import create_dataset

parser = argparse.ArgumentParser()
parser.add_argument('--model_name', type=str, default="meta-llama/Llama-2-7b-hf")
parser.add_argument('--output_dir', type=str, required=True)
parser.add_argument('--batch_size', type=int, required=False, default=8)
parser.add_argument('--max_length', type=int, required=False, default=512)
parser.add_argument('--saved_model_path', type=str, default="")

args = parser.parse_args()

args.dataset_path="yizhongw/self_instruct"
args.dataset_size=49600

max_length = args.max_length
learning_rate = 0.00002

def prepare_dataset():
    tokenizer = AutoTokenizer.from_pretrained(
        args.model_name,
        padding_side="left",
        add_eos_token=True,
        add_bos_token=True,
    )
    tokenizer.pad_token = tokenizer.eos_token
    tokenized_train_dataset, tokenized_val_dataset, data_collator = create_dataset(dataset_name=args.dataset_path,
                                                                                   tokenizer=tokenizer,
                                                                                   max_length=args.max_length)
    return tokenizer, tokenized_train_dataset, tokenized_val_dataset, data_collator


tokenizer, tokenized_train_dataset, tokenized_val_dataset, data_collator = prepare_dataset()

print("Loading model")

kwargs = {}

if "llama" in args.model_name:
    model = LlamaForCausalLM.from_pretrained(args.saved_model_path, use_cache=False, low_cpu_mem_usage=False,
                                             torch_dtype=torch.bfloat16, **kwargs)

elif "opt" in args.model_name:
    model = OPTForCausalLM.from_pretrained(args.saved_model_path, use_cache=False, low_cpu_mem_usage=False,
                                           torch_dtype=torch.float16, **kwargs)

model.resize_token_embeddings(len(tokenizer))

print("Model loaded")

print("Preparing training arguments")
training_args = transformers.TrainingArguments(
    args.output_dir,
    logging_steps=1,
    label_names=["input_ids", "attention_mask"],
    push_to_hub=False,
    report_to="none",
    disable_tqdm=True,
    per_device_train_batch_size=args.batch_size,
    per_device_eval_batch_size=8,
    do_train=False,
    do_eval=True,
    evaluation_strategy="steps",
    eval_accumulation_steps=args.batch_size,
)

metric = evaluate.load("accuracy")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions[0], references=labels[0][0])


data_collator = transformers.DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

trainer = transformers.Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.evaluate()

Jul 01 '24 19:07 PurvangL

I didn't see a diff in training loss using examples/language/llama/benchmark.py. Which dataset and script did you use for training?

Jul 02 '24 05:07 Edenzzzz

@Edenzzzz I am using script.
I am using dataset yizhongw/self_instruct

Eval logs for model trained with Hybrid Parallel plugin and pp_size=4 and tp_size=4

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:04<00:00,  1.89it/s]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /workspace/ColossalAI_old/examples/language/100_llama7b_models_627/hybrid_parallel_
4_4_64_llama2-7b-hf/epoch0-step48/model/ and are newly initialized: ['model.layers.19.mlp.down_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model
.layers.31.self_attn.k_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.lay
ers.31.self_attn.v_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.18.post_attention_layernorm.weight', '
model.layers.27.self_attn.v_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model
.layers.30.self_attn.o_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'mo
del.layers.30.self_attn.v_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.23.self_attn.k_proj.weight'
, 'model.layers.21.self_attn.v_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.27.mlp.down_proj.weight', 
'model.layers.16.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.16.self_attn.v_proj.weig
ht', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.18.mlp.up_pro
j.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.27.self_attn.k_
proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_at
tn.k_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.29.mlp.
down_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.laye
rs.31.mlp.gate_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.l
ayers.29.mlp.up_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.28.mlp.down_proj.weight', 'model.laye
rs.21.self_attn.o_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.laye
rs.17.self_attn.k_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.23.self_attn.q_proj.weight'
, 'model.layers.19.mlp.gate_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.29.post_attention_layernorm
.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.24.self_attn.o_proj.w
eight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.18.input_layernorm.weight', 'lm_head.weight', 'model.layer
s.19.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers
.31.mlp.up_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.3
0.mlp.down_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.2
0.mlp.gate_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers
.24.mlp.gate_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.
23.post_attention_layernorm.weight', 'model.norm.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.17.self_attn
.v_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.22.input_layernor
m.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.20.se
lf_attn.o_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.laye
rs.22.mlp.down_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'mode
l.layers.27.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.
layers.26.post_attention_layernorm.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.16.mlp.gate_pr
oj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.16.self_attn.q_proj
.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.laye
rs.26.self_attn.k_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 
'model.layers.30.self_attn.k_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.28.input_layernorm.weight', 
'model.layers.29.self_attn.q_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.30.post_attention_layernorm.
weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.27.input_layernorm.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32000. This might induce 
some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this gui
de: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Model loaded
Preparing training arguments
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the min
imum version or higher.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text foll
owed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; wil
l instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
{'eval_loss': 11.167560577392578, 'eval_accuracy': 0.0, 'eval_runtime': 63.0477, 'eval_samples_per_second': 7.931, 'eval_steps_per_second': 0.127}

Jul 02 '24 16:07 PurvangL

@Edenzzzz I am using script. I am using dataset yizhongw/self_instruct

Eval logs for model trained with Hybrid Parallel plugin and pp_size=4 and tp_size=4

Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:04<00:00,  1.89it/s]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /workspace/ColossalAI_old/examples/language/100_llama7b_models_627/hybrid_parallel_
4_4_64_llama2-7b-hf/epoch0-step48/model/ and are newly initialized: ['model.layers.19.mlp.down_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model
.layers.31.self_attn.k_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.lay
ers.31.self_attn.v_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.18.post_attention_layernorm.weight', '
model.layers.27.self_attn.v_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model
.layers.30.self_attn.o_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'mo
del.layers.30.self_attn.v_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.23.self_attn.k_proj.weight'
, 'model.layers.21.self_attn.v_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.27.mlp.down_proj.weight', 
'model.layers.16.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.16.self_attn.v_proj.weig
ht', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.18.mlp.up_pro
j.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.27.self_attn.k_
proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_at
tn.k_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.29.mlp.
down_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.laye
rs.31.mlp.gate_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.l
ayers.29.mlp.up_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.28.mlp.down_proj.weight', 'model.laye
rs.21.self_attn.o_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.laye
rs.17.self_attn.k_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.23.self_attn.q_proj.weight'
, 'model.layers.19.mlp.gate_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.29.post_attention_layernorm
.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.24.self_attn.o_proj.w
eight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.18.input_layernorm.weight', 'lm_head.weight', 'model.layer
s.19.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers
.31.mlp.up_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.3
0.mlp.down_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.2
0.mlp.gate_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers
.24.mlp.gate_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.
23.post_attention_layernorm.weight', 'model.norm.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.17.self_attn
.v_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.22.input_layernor
m.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.20.se
lf_attn.o_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.laye
rs.22.mlp.down_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'mode
l.layers.27.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.
layers.26.post_attention_layernorm.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.16.mlp.gate_pr
oj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.16.self_attn.q_proj
.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.laye
rs.26.self_attn.k_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 
'model.layers.30.self_attn.k_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.28.input_layernorm.weight', 
'model.layers.29.self_attn.q_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.30.post_attention_layernorm.
weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.27.input_layernorm.weight']

You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32000. This might induce 
some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this gui
de: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Model loaded
Preparing training arguments
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the min
imum version or higher.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text foll
owed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; wil
l instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
{'eval_loss': 11.167560577392578, 'eval_accuracy': 0.0, 'eval_runtime': 63.0477, 'eval_samples_per_second': 7.931, 'eval_steps_per_second': 0.127}

According to the log, it seems that the checkpoint is not loaded correctly.

Jul 08 '24 07:07 ver217

@Edenzzzz Could you please take a look at reply above? Thank you for your support.

Jul 08 '24 17:07 PurvangL

Some weights of LlamaForCausalLM were not initialized from the model checkpoint

You might have forgoten to call model.unwrap before saving it, which causes key mismatches. Booster.boost adds a wrapper to the model

Jul 08 '24 17:07 Edenzzzz

@Edenzzzz I am using script to finetune and model save. Using Gemini plugin doesn't give any warning about weight key mismatch during evaluation using same script. It only happens when I use model trained with plugin=Hybrid parallel.

method save_sharded_model from colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py script already have unwrap method applied. Not sure why it's giving key mismatches.

Looks like training with hybrid_parallel plugin doesn't restore model correctly. isn't it?

Jul 08 '24 18:07 PurvangL

You should use booster.save_model, which unwraps the model.

Jul 09 '24 08:07 Edenzzzz

@Edenzzzz Thanks for reply. Code is already using booster.save_model, and still problem is there while loading finetuned model.

is there any fix for this? To reproduce, I train OPT or llama2 model with plugin=hybrid_parallel with pp_size=4 and tp_size=4 and I already provided eval script, dataset info above.

please let me know if any other information is needed.

Jul 09 '24 16:07 PurvangL

Could you share your the keys of the model saved using hybrid parallel plugin?

Jul 10 '24 03:07 Edenzzzz

@Edenzzzz I am not sure how to get keys of the model. But one thing I noticed is when I use more than one node to finetune, then only model is not able to save with all keys. Using single node doesn't cause any key missing issue.

Jul 10 '24 20:07 PurvangL

You can try the lazy init as in here and file in a PR if it works.

https://github.com/hpcaitech/ColossalAI/blob/8e718a1421203e0f5607f477e1a998567c70d123/examples/language/llama/benchmark.py#L245

@Edenzzzz I tried adding lazyinit for OPT model, but it doesn't work. Does LazyInit work different for each model?

Jul 18 '24 21:07 PurvangL

I have tested that it works (https://github.com/hpcaitech/ColossalAI/commit/8cc8f645cd1d971a3bef52f625b7881f17c6d22b).

Jul 19 '24 02:07 Edenzzzz

@Edenzzzz I followed steps from (https://github.com/hpcaitech/ColossalAI/commit/8cc8f645cd1d971a3bef52f625b7881f17c6d22b), getting following error.

colossalai = 0.3.6 transformers=4.33.0 docker: nvcr.io/nvidia/pytorch:23.02-py3

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0  --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:12312 opt/o1.py   --model_name_or_path /workspace/ColossalAI/opt-1.3b/   --output_path test1   --plugin gemini   --batch_size 32 --learning_rate 0.00002   --weight_decay 0.01   --warmup_ratio 0.1 --max_length 512

  File "opt/o1.py", line 213, in main                                                                                                                        
    model, optimizer, _, dataloader, lr_scheduler = booster.boost(                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/booster.py", line 138, in boost                                                                     
    model, optimizer, _, dataloader, lr_scheduler = booster.boost(                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/booster.py", line 138, in boost                                                                     
    model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/plugin/gemini_plugin.py", line 546, in configure                                                    
    model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/plugin/gemini_plugin.py", line 546, in configure                                                    
    model = GeminiDDP(                                                                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/gemini_ddp.py", line 101, in __init__                                                           
    model = GeminiDDP(                                                                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/gemini_ddp.py", line 101, in __init__                                                           
    self.chunk_manager = init_chunk_manager(                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/chunk/utils.py", line 31, in init_chunk_manager                                                 
    self.chunk_manager = init_chunk_manager(                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/chunk/utils.py", line 31, in init_chunk_manager                                                 
    dist.barrier()                                                                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3221, in barrier                                                          
    dist.barrier()                                                                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3221, in barrier                                                          
    work = default_pg.barrier(opts=opts)                                                                                                                              
    RuntimeErrorwork = default_pg.barrier(opts=opts):                                                                                                                 
[4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer 
RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection 
reset by peer                                                                                                                                                         
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3701) of binary: /usr/bin/python                                         
Traceback (most recent call last):                                                                                                                                    
  File "/usr/local/bin/torchrun", line 33, in <module>                                                                                                                
    sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())                                                                            
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper                                    
    return f(*args, **kwargs)                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main                                                                           
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
opt/o1.py FAILED

Let me know if you need any other info to reproduce from your end.

Jul 19 '24 19:07 PurvangL

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿

@Edenzzzz I followed steps from (https://github.com/hpcaitech/ColossalAI/commit/8cc8f645cd1d971a3bef52f625b7881f17c6d22b), getting following error.

colossalai = 0.3.6 transformers=4.33.0 docker: nvcr.io/nvidia/pytorch:23.02-py3

torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0  --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:12312 opt/o1.py   --model_name_or_path /workspace/ColossalAI/opt-1.3b/   --output_path test1   --plugin gemini   --batch_size 32 --learning_rate 0.00002   --weight_decay 0.01   --warmup_ratio 0.1 --max_length 512

  File "opt/o1.py", line 213, in main                                                                                                                        
    model, optimizer, _, dataloader, lr_scheduler = booster.boost(                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/booster.py", line 138, in boost                                                                     
    model, optimizer, _, dataloader, lr_scheduler = booster.boost(                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/booster.py", line 138, in boost                                                                     
    model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/plugin/gemini_plugin.py", line 546, in configure                                                    
    model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/plugin/gemini_plugin.py", line 546, in configure                                                    
    model = GeminiDDP(                                                                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/gemini_ddp.py", line 101, in __init__                                                           
    model = GeminiDDP(                                                                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/gemini_ddp.py", line 101, in __init__                                                           
    self.chunk_manager = init_chunk_manager(                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/chunk/utils.py", line 31, in init_chunk_manager                                                 
    self.chunk_manager = init_chunk_manager(                                                                                                                          
  File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/chunk/utils.py", line 31, in init_chunk_manager                                                 
    dist.barrier()                                                                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3221, in barrier                                                          
    dist.barrier()                                                                                                                                                    
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3221, in barrier                                                          
    work = default_pg.barrier(opts=opts)                                                                                                                              
    RuntimeErrorwork = default_pg.barrier(opts=opts):                                                                                                                 
[4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer 
RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection 
reset by peer                                                                                                                                                         
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3701) of binary: /usr/bin/python                                         
Traceback (most recent call last):                                                                                                                                    
  File "/usr/local/bin/torchrun", line 33, in <module>                                                                                                                
    sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())                                                                            
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper                                    
    return f(*args, **kwargs)                                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main                                                                           
    run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
opt/o1.py FAILED

Let me know if you need any other info to reproduce from your end.

Jul 19 '24 19:07 Issues-translate-bot

This doesn't seem to be from Lazy Init. You should ensure your network environment, e.g. NCCL_SOCKET_IFNAME is set up correctly and upgrade to the newest Colossalai.

Jul 20 '24 02:07 Edenzzzz

[BUG]: loading OPT 66B model - CPU runs out of memory

Is there an existing issue for this bug?

🐛 Describe the bug

Environment