[BUG]: loading OPT 66B model - CPU runs out of memory
Is there an existing issue for this bug?
- [X] I have searched the existing issues
π Describe the bug
I am trying to reproduce OPT-66B using 16xH100 (2 servers). Each server has CPU memory of 1000 GiB. when I try running OPT benchmarking, I see program crashes with following error and by observing CPU memory, it reaches to 924 GiB. How can I to run OPT-66B benchmark with mentioned resources?
error
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal
performance in your application as needed.
*****************************************
/usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:48: UserWarning: `config` is deprecated and will be removed soon.
warnings.warn("`config` is deprecated and will be removed soon.")
[06/25/24 19:04:54] INFO colossalai - colossalai - INFO: /usr/local/lib/python3.8/dist-packages/colossalai/initialize.py:67 launch
[06/25/24 19:04:55] INFO colossalai - colossalai - INFO: Distributed environment is initialized, world size: 16
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51974 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51975 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51976 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51977 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51978 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51980 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 51981 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 5 (pid: 51979) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
opt/opt_train_demo.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
Environment
Docker image : nvcr.io/nvidia/pytorch:23.02-py3 transformers : 4.33 colossalai : 0.3.6
You can try the lazy init as in here and file in a PR if it works. https://github.com/hpcaitech/ColossalAI/blob/8e718a1421203e0f5607f477e1a998567c70d123/examples/language/llama/benchmark.py#L245
Thanks @Edenzzzz for suggestion. I will try. I also have one more question. During evaluation of OPT, eval loss for model trained with hybrid_parallel plugin is 5x higher than gemini plugin. and it's like this for most of the OPT variants. Do you know why?
@Edenzzzz further evaluating llama2, I see similar pattern where I see eval loss for model fine tuned with hybrid_parallel plugin is 5x higher than other plugins.
am I missing anything while evaluating model fine tuned with Hybrid Parallel plugin?
Eval script
import argparse
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM, GPTJForCausalLM, LlamaForCausalLM, OPTForCausalLM
import numpy as np
import evaluate
from dataset_util import create_dataset
parser = argparse.ArgumentParser()
parser.add_argument('--model_name', type=str, default="meta-llama/Llama-2-7b-hf")
parser.add_argument('--output_dir', type=str, required=True)
parser.add_argument('--batch_size', type=int, required=False, default=8)
parser.add_argument('--max_length', type=int, required=False, default=512)
parser.add_argument('--saved_model_path', type=str, default="")
args = parser.parse_args()
args.dataset_path="yizhongw/self_instruct"
args.dataset_size=49600
max_length = args.max_length
learning_rate = 0.00002
def prepare_dataset():
tokenizer = AutoTokenizer.from_pretrained(
args.model_name,
padding_side="left",
add_eos_token=True,
add_bos_token=True,
)
tokenizer.pad_token = tokenizer.eos_token
tokenized_train_dataset, tokenized_val_dataset, data_collator = create_dataset(dataset_name=args.dataset_path,
tokenizer=tokenizer,
max_length=args.max_length)
return tokenizer, tokenized_train_dataset, tokenized_val_dataset, data_collator
tokenizer, tokenized_train_dataset, tokenized_val_dataset, data_collator = prepare_dataset()
print("Loading model")
kwargs = {}
if "llama" in args.model_name:
model = LlamaForCausalLM.from_pretrained(args.saved_model_path, use_cache=False, low_cpu_mem_usage=False,
torch_dtype=torch.bfloat16, **kwargs)
elif "opt" in args.model_name:
model = OPTForCausalLM.from_pretrained(args.saved_model_path, use_cache=False, low_cpu_mem_usage=False,
torch_dtype=torch.float16, **kwargs)
model.resize_token_embeddings(len(tokenizer))
print("Model loaded")
print("Preparing training arguments")
training_args = transformers.TrainingArguments(
args.output_dir,
logging_steps=1,
label_names=["input_ids", "attention_mask"],
push_to_hub=False,
report_to="none",
disable_tqdm=True,
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=8,
do_train=False,
do_eval=True,
evaluation_strategy="steps",
eval_accumulation_steps=args.batch_size,
)
metric = evaluate.load("accuracy")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions[0], references=labels[0][0])
data_collator = transformers.DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)
trainer = transformers.Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train_dataset,
eval_dataset=tokenized_val_dataset,
compute_metrics=compute_metrics,
tokenizer=tokenizer,
data_collator=data_collator,
)
trainer.evaluate()
I didn't see a diff in training loss using examples/language/llama/benchmark.py. Which dataset and script did you use for training?
@Edenzzzz
I am using script.
I am using dataset yizhongw/self_instruct
Eval logs for model trained with Hybrid Parallel plugin and pp_size=4 and tp_size=4
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 8/8 [00:04<00:00, 1.89it/s]
Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /workspace/ColossalAI_old/examples/language/100_llama7b_models_627/hybrid_parallel_
4_4_64_llama2-7b-hf/epoch0-step48/model/ and are newly initialized: ['model.layers.19.mlp.down_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model
.layers.31.self_attn.k_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.lay
ers.31.self_attn.v_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.18.post_attention_layernorm.weight', '
model.layers.27.self_attn.v_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model
.layers.30.self_attn.o_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'mo
del.layers.30.self_attn.v_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.23.self_attn.k_proj.weight'
, 'model.layers.21.self_attn.v_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.27.mlp.down_proj.weight',
'model.layers.16.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.16.self_attn.v_proj.weig
ht', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.18.mlp.up_pro
j.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.27.self_attn.k_
proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_at
tn.k_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.29.mlp.
down_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.laye
rs.31.mlp.gate_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.l
ayers.29.mlp.up_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.28.mlp.down_proj.weight', 'model.laye
rs.21.self_attn.o_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.laye
rs.17.self_attn.k_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.23.self_attn.q_proj.weight'
, 'model.layers.19.mlp.gate_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.29.post_attention_layernorm
.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.24.self_attn.o_proj.w
eight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.18.input_layernorm.weight', 'lm_head.weight', 'model.layer
s.19.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers
.31.mlp.up_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.3
0.mlp.down_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.2
0.mlp.gate_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers
.24.mlp.gate_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.
23.post_attention_layernorm.weight', 'model.norm.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.17.self_attn
.v_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.22.input_layernor
m.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.20.se
lf_attn.o_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.laye
rs.22.mlp.down_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'mode
l.layers.27.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.
layers.26.post_attention_layernorm.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.16.mlp.gate_pr
oj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.16.self_attn.q_proj
.weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.laye
rs.26.self_attn.k_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.24.post_attention_layernorm.weight',
'model.layers.30.self_attn.k_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.28.input_layernorm.weight',
'model.layers.29.self_attn.q_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.30.post_attention_layernorm.
weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.27.input_layernorm.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32000. This might induce
some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this gui
de: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc
Model loaded
Preparing training arguments
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the min
imum version or higher.
You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text foll
owed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; wil
l instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
{'eval_loss': 11.167560577392578, 'eval_accuracy': 0.0, 'eval_runtime': 63.0477, 'eval_samples_per_second': 7.931, 'eval_steps_per_second': 0.127}
@Edenzzzz I am using script. I am using dataset yizhongw/self_instruct
Eval logs for model trained with Hybrid Parallel plugin and pp_size=4 and tp_size=4
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 8/8 [00:04<00:00, 1.89it/s] Some weights of LlamaForCausalLM were not initialized from the model checkpoint at /workspace/ColossalAI_old/examples/language/100_llama7b_models_627/hybrid_parallel_ 4_4_64_llama2-7b-hf/epoch0-step48/model/ and are newly initialized: ['model.layers.19.mlp.down_proj.weight', 'model.layers.16.post_attention_layernorm.weight', 'model .layers.31.self_attn.k_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.lay ers.31.self_attn.v_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.18.post_attention_layernorm.weight', ' model.layers.27.self_attn.v_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model .layers.30.self_attn.o_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'mo del.layers.30.self_attn.v_proj.weight', 'model.layers.16.input_layernorm.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.23.self_attn.k_proj.weight' , 'model.layers.21.self_attn.v_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.17.input_layernorm.weight', 'model.layers.16.self_attn.v_proj.weig ht', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.31.post_attention_layernorm.weight', 'model.layers.18.mlp.up_pro j.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.27.self_attn.k_ proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.29.input_layernorm.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.25.self_at tn.k_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.29.mlp. down_proj.weight', 'model.layers.25.post_attention_layernorm.weight', 'model.layers.24.input_layernorm.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.laye rs.31.mlp.gate_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.l ayers.29.mlp.up_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.23.input_layernorm.weight', 'model.layers.28.mlp.down_proj.weight', 'model.laye rs.21.self_attn.o_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.26.input_layernorm.weight', 'model.laye rs.17.self_attn.k_proj.weight', 'model.layers.22.post_attention_layernorm.weight', 'model.layers.30.input_layernorm.weight', 'model.layers.23.self_attn.q_proj.weight' , 'model.layers.19.mlp.gate_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.29.post_attention_layernorm .weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.24.self_attn.o_proj.w eight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.18.input_layernorm.weight', 'lm_head.weight', 'model.layer s.19.input_layernorm.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers .31.mlp.up_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.3 0.mlp.down_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.2 0.mlp.gate_proj.weight', 'model.layers.20.input_layernorm.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers .24.mlp.gate_proj.weight', 'model.layers.21.input_layernorm.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers. 23.post_attention_layernorm.weight', 'model.norm.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.25.input_layernorm.weight', 'model.layers.17.self_attn .v_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.22.input_layernor m.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.27.post_attention_layernorm.weight', 'model.layers.31.input_layernorm.weight', 'model.layers.20.se lf_attn.o_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.19.post_attention_layernorm.weight', 'model.laye rs.22.mlp.down_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.28.post_attention_layernorm.weight', 'mode l.layers.27.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model. layers.26.post_attention_layernorm.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.17.post_attention_layernorm.weight', 'model.layers.16.mlp.gate_pr oj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.16.self_attn.q_proj .weight', 'model.layers.20.post_attention_layernorm.weight', 'model.layers.21.post_attention_layernorm.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.laye rs.26.self_attn.k_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.24.post_attention_layernorm.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.28.input_layernorm.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.30.post_attention_layernorm. weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.27.input_layernorm.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. You are resizing the embedding layer without providing a `pad_to_multiple_of` parameter. This means that the new embedding dimension will be 32000. This might induce some performance reduction as *Tensor Cores* will not be available. For more details about this, or help on choosing the correct value for resizing, refer to this gui de: https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc Model loaded Preparing training arguments Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the min imum version or higher. You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text foll owed by a call to the `pad` method to get a padded encoding. /usr/local/lib/python3.8/dist-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; wil l instead unsqueeze and return a vector. warnings.warn('Was asked to gather along dimension 0, but all ' {'eval_loss': 11.167560577392578, 'eval_accuracy': 0.0, 'eval_runtime': 63.0477, 'eval_samples_per_second': 7.931, 'eval_steps_per_second': 0.127}
According to the log, it seems that the checkpoint is not loaded correctly.
@Edenzzzz Could you please take a look at reply above? Thank you for your support.
Some weights of LlamaForCausalLM were not initialized from the model checkpoint
You might have forgoten to call model.unwrap before saving it, which causes key mismatches. Booster.boost adds a wrapper to the model
@Edenzzzz I am using script to finetune and model save. Using Gemini plugin doesn't give any warning about weight key mismatch during evaluation using same script. It only happens when I use model trained with plugin=Hybrid parallel.
method save_sharded_model from colossalai/checkpoint_io/hybrid_parallel_checkpoint_io.py script already have unwrap method applied. Not sure why it's giving key mismatches.
Looks like training with hybrid_parallel plugin doesn't restore model correctly. isn't it?
You should use booster.save_model, which unwraps the model.
@Edenzzzz Thanks for reply. Code is already using booster.save_model, and still problem is there while loading finetuned model.
is there any fix for this? To reproduce, I train OPT or llama2 model with plugin=hybrid_parallel with pp_size=4 and tp_size=4 and I already provided eval script, dataset info above.
please let me know if any other information is needed.
Could you share your the keys of the model saved using hybrid parallel plugin?
@Edenzzzz I am not sure how to get keys of the model. But one thing I noticed is when I use more than one node to finetune, then only model is not able to save with all keys. Using single node doesn't cause any key missing issue.
You can try the lazy init as in here and file in a PR if it works.
https://github.com/hpcaitech/ColossalAI/blob/8e718a1421203e0f5607f477e1a998567c70d123/examples/language/llama/benchmark.py#L245
@Edenzzzz I tried adding lazyinit for OPT model, but it doesn't work. Does LazyInit work different for each model?
I have tested that it works (https://github.com/hpcaitech/ColossalAI/commit/8cc8f645cd1d971a3bef52f625b7881f17c6d22b).
@Edenzzzz I followed steps from (https://github.com/hpcaitech/ColossalAI/commit/8cc8f645cd1d971a3bef52f625b7881f17c6d22b), getting following error.
colossalai = 0.3.6 transformers=4.33.0 docker: nvcr.io/nvidia/pytorch:23.02-py3
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:12312 opt/o1.py --model_name_or_path /workspace/ColossalAI/opt-1.3b/ --output_path test1 --plugin gemini --batch_size 32 --learning_rate 0.00002 --weight_decay 0.01 --warmup_ratio 0.1 --max_length 512
File "opt/o1.py", line 213, in main
model, optimizer, _, dataloader, lr_scheduler = booster.boost(
File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/booster.py", line 138, in boost
model, optimizer, _, dataloader, lr_scheduler = booster.boost(
File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/booster.py", line 138, in boost
model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(
File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/plugin/gemini_plugin.py", line 546, in configure
model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(
File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/plugin/gemini_plugin.py", line 546, in configure
model = GeminiDDP(
File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/gemini_ddp.py", line 101, in __init__
model = GeminiDDP(
File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/gemini_ddp.py", line 101, in __init__
self.chunk_manager = init_chunk_manager(
File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/chunk/utils.py", line 31, in init_chunk_manager
self.chunk_manager = init_chunk_manager(
File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/chunk/utils.py", line 31, in init_chunk_manager
dist.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3221, in barrier
dist.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3221, in barrier
work = default_pg.barrier(opts=opts)
RuntimeErrorwork = default_pg.barrier(opts=opts):
[4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer
RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection
reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3701) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
opt/o1.py FAILED
Let me know if you need any other info to reproduce from your end.
Bot detected the issue body's language is not English, translate it automatically. π―ππ»π§βπ€βπ§π«π§πΏβπ€βπ§π»π©πΎβπ€βπ¨πΏπ¬πΏ
@Edenzzzz I followed steps from (https://github.com/hpcaitech/ColossalAI/commit/8cc8f645cd1d971a3bef52f625b7881f17c6d22b), getting following error.
colossalai = 0.3.6 transformers=4.33.0 docker: nvcr.io/nvidia/pytorch:23.02-py3
torchrun --nproc_per_node=8 --nnodes=1 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=x.x.x.x:12312 opt/o1.py --model_name_or_path /workspace/ColossalAI/opt-1.3b/ --output_path test1 --plugin gemini --batch_size 32 --learning_rate 0.00002 --weight_decay 0.01 --warmup_ratio 0.1 --max_length 512
File "opt/o1.py", line 213, in main
model, optimizer, _, dataloader, lr_scheduler = booster.boost(
File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/booster.py", line 138, in boost
model, optimizer, _, dataloader, lr_scheduler = booster.boost(
File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/booster.py", line 138, in boost
model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(
File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/plugin/gemini_plugin.py", line 546, in configure
model, optimizer, criterion, dataloader, lr_scheduler = self.plugin.configure(
File "/usr/local/lib/python3.8/dist-packages/colossalai/booster/plugin/gemini_plugin.py", line 546, in configure
model = GeminiDDP(
File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/gemini_ddp.py", line 101, in __init__
model = GeminiDDP(
File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/gemini_ddp.py", line 101, in __init__
self.chunk_manager = init_chunk_manager(
File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/chunk/utils.py", line 31, in init_chunk_manager
self.chunk_manager = init_chunk_manager(
File "/usr/local/lib/python3.8/dist-packages/colossalai/zero/gemini/chunk/utils.py", line 31, in init_chunk_manager
dist.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3221, in barrier
dist.barrier()
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 3221, in barrier
work = default_pg.barrier(opts=opts)
RuntimeErrorwork = default_pg.barrier(opts=opts):
[4] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection reset by peer
RuntimeError: [2] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Connection
reset by peer
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3701) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==1.14.0a0+44dac51', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
opt/o1.py FAILED
Let me know if you need any other info to reproduce from your end.
This doesn't seem to be from Lazy Init. You should ensure your network environment, e.g. NCCL_SOCKET_IFNAME is set up correctly and upgrade to the newest Colossalai.