transformers
transformers copied to clipboard
Shared tensors not correctly saved.
System Info
-
transformers
version: 4.36.0.dev0 - Platform: Linux-4.19.0-25-cloud-amd64-x86_64-with-glibc2.28
- Python version: 3.9.17
- Huggingface_hub version: 0.16.4
- Safetensors version: 0.3.2
- Accelerate version: 0.24.1
- Accelerate config: not found
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: 8*A100
- Using distributed or parallel set-up in script?: accelerate + deepspeed zero3
Who can help?
No response
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
I am finetuning Fuyu-8B and found the code for calling model.save_pretrained
method would run into error after upgrading to 4.36.0
.
The error shows:
Removed shared tensor {'language_model.model.layers.12.self_attn.dense.weight', 'language_model.model.layers.22.self_attn.k_layernorm.weight', 'language_model.model.layers.24.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.mlp.dense_h_to_4h.weight', 'language_model.model.layers.22.input_layernorm.weight', 'language_model.model.layers.25.self_attn.q_layernorm.weight', 'language_model.model.layers.8.self_attn.query_key_value.bias', 'language_model.model.layers.33.mlp.dense_4h_to_h.bias', 'language_model.model.layers.6.post_attention_layernorm.weight', 'language_model.model.layers.30.self_attn.query_key_value.weight', 'language_model.model.layers.5.self_attn.query_key_value.weight', 'language_model.model.layers.10.mlp.dense_h_to_4h.bias', 'language_model.model.layers.5.post_attention_layernorm.weight', 'language_model.model.layers.15.mlp.dense_4h_to_h.bias', 'language_model.model.layers.2.self_attn.query_key_value.bias', 'language_model.model.layers.4.input_layernorm.bias', 'language_model.model.layers.25.self_attn.k_layernorm.weight', 'language_model.model.layers.29.self_attn.query_key_value.weight', 'language_model.model.layers.13.self_attn.query_key_value.bias', 'language_model.lm_head.weight', 'language_model.model.layers.6.mlp.dense_h_to_4h.weight', 'language_model.model.layers.13.mlp.dense_4h_to_h.weight', 'language_model.model.layers.14.mlp.dense_h_to_4h.weight', 'language_model.model.layers.31.mlp.dense_h_to_4h.weight', 'language_model.model.layers.32.input_layernorm.weight', 'language_model.model.layers.19.mlp.dense_4h_to_h.bias', 'language_model.model.layers.24.self_attn.dense.bias', 'language_model.model.layers.5.self_attn.query_key_value.bias', 'language_model.model.layers.7.mlp.dense_4h_to_h.bias', 'language_model.model.layers.10.self_attn.query_key_value.bias', 'language_model.model.layers.18.mlp.dense_h_to_4h.weight', 'language_model.model.layers.29.post_attention_layernorm.bias', 'language_model.model.layers.11.self_attn.dense.weight', 'language_model.model.layers.28.self_attn.query_key_value.weight', 'language_model.model.layers.14.mlp.dense_4h_to_h.weight', 'language_model.model.layers.15.mlp.dense_4h_to_h.weight', 'language_model.model.layers.35.mlp.dense_4h_to_h.weight', 'language_model.model.layers.17.post_attention_layernorm.bias', 'language_model.model.layers.23.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.mlp.dense_h_to_4h.bias', 'language_model.model.final_layernorm.weight', 'language_model.model.layers.6.mlp.dense_4h_to_h.weight', 'language_model.model.layers.29.input_layernorm.weight', 'language_model.model.layers.13.self_attn.q_layernorm.bias', 'language_model.model.layers.6.self_attn.dense.weight', 'language_model.model.layers.22.self_attn.query_key_value.weight', 'language_model.model.layers.35.post_attention_layernorm.bias', 'language_model.model.layers.23.self_attn.dense.bias', 'language_model.model.layers.16.self_attn.k_layernorm.weight', 'language_model.model.layers.32.self_attn.dense.weight', 'language_model.model.layers.25.self_attn.dense.bias', 'language_model.model.layers.9.self_attn.query_key_value.bias', 'language_model.model.layers.25.self_attn.k_layernorm.bias', 'language_model.model.layers.3.mlp.dense_h_to_4h.weight', 'language_model.model.layers.21.self_attn.q_layernorm.weight', 'language_model.model.layers.32.post_attention_layernorm.bias', 'language_model.model.layers.33.self_attn.q_layernorm.weight', 'language_model.model.layers.2.post_attention_layernorm.bias', 'language_model.model.layers.20.mlp.dense_4h_to_h.bias', 'language_model.model.layers.4.self_attn.k_layernorm.bias', 'language_model.model.layers.29.mlp.dense_4h_to_h.weight', 'language_model.model.layers.32.self_attn.dense.bias', 'language_model.model.layers.8.mlp.dense_h_to_4h.weight', 'language_model.model.layers.34.self_attn.query_key_value.bias', 'language_model.model.layers.35.self_attn.k_layernorm.bias', 'language_model.model.layers.4.post_attention_layernorm.bias', 'language_model.model.layers.28.mlp.dense_4h_to_h.bias', 'language_model.model.layers.8.self_attn.q_layernorm.bias', 'language_model.model.layers.32.self_attn.k_layernorm.weight', 'language_model.model.layers.28.self_attn.dense.weight', 'language_model.model.layers.31.mlp.dense_4h_to_h.bias', 'language_model.model.layers.0.mlp.dense_4h_to_h.weight', 'language_model.model.layers.11.mlp.dense_h_to_4h.weight', 'language_model.model.layers.29.mlp.dense_4h_to_h.bias', 'language_model.model.layers.19.mlp.dense_h_to_4h.weight', 'language_model.model.layers.12.post_attention_layernorm.weight', 'language_model.model.layers.7.self_attn.query_key_value.weight', 'language_model.model.layers.13.input_layernorm.weight', 'language_model.model.layers.31.mlp.dense_h_to_4h.bias', 'language_model.model.layers.0.self_attn.k_layernorm.bias', 'language_model.model.layers.34.self_attn.q_layernorm.bias', 'language_model.model.layers.1.self_attn.k_layernorm.weight', 'language_model.model.layers.35.self_attn.q_layernorm.weight', 'language_model.model.layers.29.self_attn.k_layernorm.bias', 'language_model.model.layers.34.mlp.dense_4h_to_h.weight', 'language_model.model.layers.30.mlp.dense_h_to_4h.bias', 'language_model.model.layers.0.input_layernorm.bias', 'language_model.model.layers.18.self_attn.query_key_value.weight', 'language_model.model.layers.1.mlp.dense_h_to_4h.bias', 'language_model.model.layers.26.mlp.dense_h_to_4h.weight', 'language_model.model.layers.8.post_attention_layernorm.weight', 'language_model.model.layers.18.self_attn.dense.bias', 'language_model.model.layers.30.mlp.dense_4h_to_h.bias', 'language_model.model.layers.7.mlp.dense_h_to_4h.bias', 'language_model.model.layers.31.self_attn.dense.weight', 'language_model.model.layers.9.self_attn.query_key_value.weight', 'language_model.model.layers.12.input_layernorm.bias', 'language_model.model.layers.14.self_attn.q_layernorm.weight', 'language_model.model.layers.28.self_attn.dense.bias', 'language_model.model.layers.6.self_attn.q_layernorm.bias', 'language_model.model.layers.30.self_attn.query_key_value.bias', 'language_model.model.layers.11.self_attn.q_layernorm.weight', 'language_model.model.layers.33.self_attn.dense.bias', 'language_model.model.layers.14.mlp.dense_h_to_4h.bias', 'language_model.model.layers.14.mlp.dense_4h_to_h.bias', 'language_model.model.layers.12.mlp.dense_h_to_4h.weight', 'language_model.model.layers.10.self_attn.dense.weight', 'language_model.model.layers.5.self_attn.k_layernorm.weight', 'language_model.model.layers.33.mlp.dense_h_to_4h.weight', 'language_model.model.layers.17.mlp.dense_4h_to_h.weight', 'language_model.model.layers.19.self_attn.dense.bias', 'language_model.model.layers.4.mlp.dense_4h_to_h.bias', 'language_model.model.layers.19.self_attn.query_key_value.weight', 'language_model.model.layers.8.input_layernorm.bias', 'language_model.model.layers.6.self_attn.k_layernorm.bias', 'language_model.model.layers.31.self_attn.dense.bias', 'language_model.model.layers.25.self_attn.query_key_value.bias', 'language_model.model.layers.34.self_attn.q_layernorm.weight', 'language_model.model.layers.7.input_layernorm.bias', 'language_model.model.layers.2.self_attn.k_layernorm.bias', 'language_model.model.layers.29.self_attn.q_layernorm.bias', 'language_model.model.layers.16.self_attn.query_key_value.bias', 'language_model.model.layers.35.mlp.dense_h_to_4h.weight', 'language_model.model.layers.35.post_attention_layernorm.weight', 'language_model.model.layers.1.self_attn.dense.weight', 'language_model.model.layers.4.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.input_layernorm.bias', 'language_model.model.layers.4.post_attention_layernorm.weight', 'language_model.model.layers.14.input_layernorm.weight', 'language_model.model.layers.22.mlp.dense_4h_to_h.bias', 'language_model.model.layers.11.input_layernorm.weight', 'language_model.model.layers.27.self_attn.k_layernorm.bias', 'language_model.model.layers.18.mlp.dense_4h_to_h.bias', 'language_model.model.layers.25.mlp.dense_h_to_4h.bias', 'language_model.model.layers.32.input_layernorm.bias', 'language_model.model.layers.10.mlp.dense_h_to_4h.weight', 'language_model.model.layers.14.self_attn.k_layernorm.weight', 'language_model.model.layers.8.post_attention_layernorm.bias', 'language_model.model.layers.27.self_attn.dense.bias', 'language_model.model.layers.21.self_attn.k_layernorm.weight', 'language_model.model.layers.27.self_attn.q_layernorm.weight', 'language_model.model.layers.30.self_attn.dense.weight', 'language_model.model.layers.23.mlp.dense_4h_to_h.bias', 'language_model.model.layers.18.post_attention_layernorm.weight', 'language_model.model.layers.22.self_attn.q_layernorm.weight', 'language_model.model.layers.13.self_attn.dense.bias', 'language_model.model.layers.14.self_attn.query_key_value.bias', 'language_model.model.layers.10.self_attn.k_layernorm.bias', 'language_model.model.layers.34.input_layernorm.bias', 'language_model.model.layers.3.post_attention_layernorm.bias', 'language_model.model.layers.5.input_layernorm.weight', 'language_model.model.layers.8.self_attn.query_key_value.weight', 'language_model.model.layers.27.post_attention_layernorm.bias', 'language_model.model.layers.28.mlp.dense_h_to_4h.weight', 'language_model.model.layers.28.self_attn.q_layernorm.weight', 'language_model.model.layers.5.mlp.dense_4h_to_h.weight', 'language_model.model.layers.19.self_attn.dense.weight', 'language_model.model.layers.21.input_layernorm.weight', 'language_model.model.layers.14.post_attention_layernorm.bias', 'language_model.model.layers.35.self_attn.query_key_value.bias', 'language_model.model.layers.10.mlp.dense_4h_to_h.weight', 'language_model.model.layers.17.self_attn.q_layernorm.bias', 'language_model.model.layers.25.input_layernorm.bias', 'language_model.model.layers.34.self_attn.dense.weight', 'language_model.model.layers.34.input_layernorm.weight', 'language_model.model.layers.5.self_attn.k_layernorm.bias', 'language_model.model.layers.2.mlp.dense_4h_to_h.weight', 'language_model.model.layers.11.self_attn.dense.bias', 'language_model.model.layers.17.mlp.dense_4h_to_h.bias', 'language_model.model.layers.13.mlp.dense_4h_to_h.bias', 'language_model.model.layers.21.self_attn.query_key_value.weight', 'language_model.model.lay
207 Saved checkpoint at epoch 1.
I try to set the safe_serialization=False
, the warning disappear but the saved pytorch_model.bin only 2MB, comparing to around 18GB originally (using 4.35.0).
Expected behavior
See above
Thanks for reporting! I can reproduce, I'm fixing.
Hmmm actually, it seems like it was just a mistake on my end, I cannot reproduce after trying again.
If you load the fuyu model, save it, and reload it once again, do you have an error? Does it only happen after fine-tuning?
Hmmm actually, it seems like it was just a mistake on my end, I cannot reproduce after trying again.
If you load the fuyu model, save it, and reload it once again, do you have an error? Does it only happen after fine-tuning?
ohh I see, you dont have it on transformers==4.36.0
?? I had this issue on two of my instances. Let's me figure out the details.
I think the problem may comes from if you try to use accelerator
+ deepspeed zero3
to wrap the model?
The error happens at my script in this line, you can take a look at the model specific configs if that could provide more contexts!
https://github.com/Luodian/Otter/blob/ca69589b7e4475c9e87836de30e7fc91bbee74b6/pipeline/train/instruction_following.py#L523
Thanks for sharing! I'm trying to reproduce
But if reproducing it is difficult for your side, I could share you more information and my possible guess when I'm more available!
Now at least I could run my all code with 4.35.0
. And I think this issue would also remind other users to prevent it somehow.
Understood, it's likely it indeed comes from the safe serialization then. Do you have a command I can run using Otter? I'd like to dive in and see what may fail, I see you have different ways of saving the checkpoint:
https://github.com/Luodian/Otter/blob/ca69589b7e4475c9e87836de30e7fc91bbee74b6/pipeline/train/train_utils.py#L229-L262
Here's a minimal one:
accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_zero2.yaml \
--num_processes=1 \
pipeline/train/instruction_following.py \
--pretrained_model_name_or_path=adept/fuyu-8b \
--training_data_yaml=./Demo_Data.yaml \
--model_name=fuyu \
--instruction_format=fuyu \
--batch_size=1 \
--gradient_accumulation_steps=2 \
--num_epochs=3 \
--external_save_dir=./checkpoints \
--run_name=Fuyu_Save_Tester \
--wandb_project=Fuyu \
--workers=${WORKERS} \
--lr_scheduler=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.03 \
--save_hf_model \
--max_seq_len=1024 \
--logging_steps=1000 \
--keep_symbols \
--save_ckpt_each_epoch \
--dynamic_resolution \
--with_task_description
The data could be set at Demo_Data.yaml
and the files are downloaded from:
instruction.json
file at here and the images.parquet
file at here.
Understood, it's likely it indeed comes from the safe serialization then. Do you have a command I can run using Otter? I'd like to dive in and see what may fail, I see you have different ways of saving the checkpoint:
https://github.com/Luodian/Otter/blob/ca69589b7e4475c9e87836de30e7fc91bbee74b6/pipeline/train/train_utils.py#L229-L262
basically the errors comes from
--save_hf_model
Ok, let me try this. Are you tying weights yourself that weren't originally tied in the fuyu model?
Ok, let me try this. Are you tying weights yourself that weren't originally tied in the fuyu model?
What does this mean sorry? I didnt try other models, I think Fuyu also didnt have specifically manipulation of model saving process. It calls the save_pretrained
from modeling_utils.py
I guess?
Ok that works no problem :) I was just making sure you weren't tying some weights yourself within the model, as this might go wrong on reload.
I'm currently debugging your script, will report here.
Edit: handed it out to the fantastic @muellerzr
oh yes, the model weights are directly from adept/fuyu-8b
.
But we have our implementations inside modeling_persimmon.py
, this is the base LLM of Fuyu. It mainly about the throughput optimization (improved 4x) and integration of flash attention and fused operators.
Does them count for the error? I think both the following versions would err with 4.36.0
.
- One of my instance has flash attention so it calls our version of
modeling_persimmon.py
. - Another instance didnt have flash attention so it calls transformers
modeling_persimmon.py
The logic is:
try:
from .modeling_persimmon import PersimmonForCausalLM
print("Using local PersimmonForCausalLM with Flash Attention")
except ImportError:
from transformers import PersimmonForCausalLM
print("Using transformers PersimmonForCausalLM without Flash Attention")
Working on trying to reproduce this :)
Successfully could reproduce, a minimal repr is below:
import torch
from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin, HfDeepSpeedConfig
from transformers import AutoModelForCausalLM
from transformers.modeling_utils import unwrap_model
transformers_config = HfDeepSpeedConfig({
"train_micro_batch_size_per_gpu": 2,
"gradient_accumulation_steps": 2,
"gradient_clipping": 1.0,
"offload_optimizer_device": None,
"offload_param_device": None,
"zero3_init_flag": False,
"zero_optimization": {
"stage": 2,
},
})
plugin = DeepSpeedPlugin(transformers_config)
accelerator = Accelerator(deepspeed_plugin=plugin)
model_name = "bert-base-cased"
model = AutoModelForCausalLM.from_pretrained(model_name)
opt = torch.optim.Adam(model.parameters(), lr=1e-5)
model, opt = accelerator._prepare_deepspeed(model, opt)
state_dict = accelerator.get_state_dict(model)
model = unwrap_model(model)
model.save_pretrained(
"testing_fuyu_8b",
state_dict=state_dict,
safe_serialization=True
)
@Luodian can you try again when num_processes >1
? I couldn't reproduce it.
I can only reproduce your main example here because currently Accelerate doesn't really support single-GPU deepspeed
sorry Im little busy these days, I may report it later, but may not very soon.
I have the same issue, āRemoved shared tensorā. Transformers 4.35.2, using deepspeed on 1 gpu. Following the comments here, I disabled deepspeed and now it is saving correctly.
I imagine if you are getting this error, you are running deepspeed on a 1 gpu machine.
But I truly had this issue when using 2 or 8 GPUs with deepspeed zero3. š§
Agree. I got the same issue when I just ran it on my 8gpu instance with deepspeed. I even downgraded to 4.35.0 and still have the same issue.
basically my code saves a bert module in one folder, and saves the overall model in another folder. I hypothesize that when saving with safetensors, if it noticed that you are saving duplicate weights and biases, it saves the full thing once and when you try re-saving it, it will remove the shared modules (to save on disk space, I guess). In my case, it was removing all of my layers except for the tail Embedding layer.
luckily for me, setting safe_serialization = False fixed it for me. I hope you can figure out how to fix yours too @Luodian
Bu the way, in case it matters, I am using deepspeed zero stage 0, but for Trainer it only began to use dp16 and gradient checkpointing and stuff when I pass the deepspeed config (even though it is stage 0)
Agree. I got the same issue when I just ran it on my 8gpu instance with deepspeed. I even downgraded to 4.35.0 and still have the same issue.
basically my code saves a bert module in one folder, and saves the overall model in another folder. I hypothesize that when saving with safetensors, if it noticed that you are saving duplicate weights and biases, it saves the full thing once and when you try re-saving it, it will remove the shared modules (to save on disk space, I guess). In my case, it was removing all of my layers except for the tail Embedding layer.
luckily for me, setting safe_serialization = False fixed it for me. I hope you can figure out how to fix yours too @Luodian
yes, I use 4.35.1
and safe_serialization=False
solved my issue. And I am also gonna fix in this version until this issue be full addressed (in deepspeed zero0/1/2/3, and multi-gpus).
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Encountered the error while using:
accelerate=0.25.0
transformers=4.36.2
Single gpu, not using deepspeed. Accelerate config:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_config:
dynamo_backend: INDUCTOR
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
When calling accelerator.save_state(dir)
to save a flan-t5-small model, I get:
Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
and then when I reload the model accelerator.load_state(dir)
, I get:
RuntimeError: Error(s) in loading state_dict for T5ForConditionalGeneration:
Missing key(s) in state_dict: "encoder.embed_tokens.weight", "decoder.embed_tokens.weight".
Calling accelerator.save_state(dir, safe_serialization=False)
works, but doesn't solve the underlying problem. Calling accelerate.save_state(dir)
and then accelerate.load_state(dir)
shouldn't throw an error. Why is safe_serialization
removing these two shared tensors? Not sure what is the best solution, but this should be automatically handled.
Hi @GabPrato,
I had the same issue with Accelerate while using for single GPU.
Using, safe_serialization=False
in accelerate.save_state() resolved it.
i have the same issue
cc @muellerzr @pacman100
Hi all, please explain more about how you're using Accelerator.save_state()
here please? We don't expose that part of the API in the Trainer
, so how that is being called could be the root of our issue (as the before error fully and completely passes)
As well as full and complete code
I tried with every latest version of the packages, and the reproducer runs for me.
Happy to help whenever there's a reproducer !
I run the same code with the following codes and got the issue. https://github.com/huggingface/trl/issues/1121
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.