transformers Shared tensors not correctly saved.

System Info

transformers version: 4.36.0.dev0
Platform: Linux-4.19.0-25-cloud-amd64-x86_64-with-glibc2.28
Python version: 3.9.17
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.2
Accelerate version: 0.24.1
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu118 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: 8*A100
Using distributed or parallel set-up in script?: accelerate + deepspeed zero3

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I am finetuning Fuyu-8B and found the code for calling model.save_pretrained method would run into error after upgrading to 4.36.0.

The error shows:

Removed shared tensor {'language_model.model.layers.12.self_attn.dense.weight', 'language_model.model.layers.22.self_attn.k_layernorm.weight', 'language_model.model.layers.24.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.mlp.dense_h_to_4h.weight', 'language_model.model.layers.22.input_layernorm.weight', 'language_model.model.layers.25.self_attn.q_layernorm.weight', 'language_model.model.layers.8.self_attn.query_key_value.bias', 'language_model.model.layers.33.mlp.dense_4h_to_h.bias', 'language_model.model.layers.6.post_attention_layernorm.weight', 'language_model.model.layers.30.self_attn.query_key_value.weight', 'language_model.model.layers.5.self_attn.query_key_value.weight', 'language_model.model.layers.10.mlp.dense_h_to_4h.bias', 'language_model.model.layers.5.post_attention_layernorm.weight', 'language_model.model.layers.15.mlp.dense_4h_to_h.bias', 'language_model.model.layers.2.self_attn.query_key_value.bias', 'language_model.model.layers.4.input_layernorm.bias', 'language_model.model.layers.25.self_attn.k_layernorm.weight', 'language_model.model.layers.29.self_attn.query_key_value.weight', 'language_model.model.layers.13.self_attn.query_key_value.bias', 'language_model.lm_head.weight', 'language_model.model.layers.6.mlp.dense_h_to_4h.weight', 'language_model.model.layers.13.mlp.dense_4h_to_h.weight', 'language_model.model.layers.14.mlp.dense_h_to_4h.weight', 'language_model.model.layers.31.mlp.dense_h_to_4h.weight', 'language_model.model.layers.32.input_layernorm.weight', 'language_model.model.layers.19.mlp.dense_4h_to_h.bias', 'language_model.model.layers.24.self_attn.dense.bias', 'language_model.model.layers.5.self_attn.query_key_value.bias', 'language_model.model.layers.7.mlp.dense_4h_to_h.bias', 'language_model.model.layers.10.self_attn.query_key_value.bias', 'language_model.model.layers.18.mlp.dense_h_to_4h.weight', 'language_model.model.layers.29.post_attention_layernorm.bias', 'language_model.model.layers.11.self_attn.dense.weight', 'language_model.model.layers.28.self_attn.query_key_value.weight', 'language_model.model.layers.14.mlp.dense_4h_to_h.weight', 'language_model.model.layers.15.mlp.dense_4h_to_h.weight', 'language_model.model.layers.35.mlp.dense_4h_to_h.weight', 'language_model.model.layers.17.post_attention_layernorm.bias', 'language_model.model.layers.23.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.mlp.dense_h_to_4h.bias', 'language_model.model.final_layernorm.weight', 'language_model.model.layers.6.mlp.dense_4h_to_h.weight', 'language_model.model.layers.29.input_layernorm.weight', 'language_model.model.layers.13.self_attn.q_layernorm.bias', 'language_model.model.layers.6.self_attn.dense.weight', 'language_model.model.layers.22.self_attn.query_key_value.weight', 'language_model.model.layers.35.post_attention_layernorm.bias', 'language_model.model.layers.23.self_attn.dense.bias', 'language_model.model.layers.16.self_attn.k_layernorm.weight', 'language_model.model.layers.32.self_attn.dense.weight', 'language_model.model.layers.25.self_attn.dense.bias', 'language_model.model.layers.9.self_attn.query_key_value.bias', 'language_model.model.layers.25.self_attn.k_layernorm.bias', 'language_model.model.layers.3.mlp.dense_h_to_4h.weight', 'language_model.model.layers.21.self_attn.q_layernorm.weight', 'language_model.model.layers.32.post_attention_layernorm.bias', 'language_model.model.layers.33.self_attn.q_layernorm.weight', 'language_model.model.layers.2.post_attention_layernorm.bias', 'language_model.model.layers.20.mlp.dense_4h_to_h.bias', 'language_model.model.layers.4.self_attn.k_layernorm.bias', 'language_model.model.layers.29.mlp.dense_4h_to_h.weight', 'language_model.model.layers.32.self_attn.dense.bias', 'language_model.model.layers.8.mlp.dense_h_to_4h.weight', 'language_model.model.layers.34.self_attn.query_key_value.bias', 'language_model.model.layers.35.self_attn.k_layernorm.bias', 'language_model.model.layers.4.post_attention_layernorm.bias', 'language_model.model.layers.28.mlp.dense_4h_to_h.bias', 'language_model.model.layers.8.self_attn.q_layernorm.bias', 'language_model.model.layers.32.self_attn.k_layernorm.weight', 'language_model.model.layers.28.self_attn.dense.weight', 'language_model.model.layers.31.mlp.dense_4h_to_h.bias', 'language_model.model.layers.0.mlp.dense_4h_to_h.weight', 'language_model.model.layers.11.mlp.dense_h_to_4h.weight', 'language_model.model.layers.29.mlp.dense_4h_to_h.bias', 'language_model.model.layers.19.mlp.dense_h_to_4h.weight', 'language_model.model.layers.12.post_attention_layernorm.weight', 'language_model.model.layers.7.self_attn.query_key_value.weight', 'language_model.model.layers.13.input_layernorm.weight', 'language_model.model.layers.31.mlp.dense_h_to_4h.bias', 'language_model.model.layers.0.self_attn.k_layernorm.bias', 'language_model.model.layers.34.self_attn.q_layernorm.bias', 'language_model.model.layers.1.self_attn.k_layernorm.weight', 'language_model.model.layers.35.self_attn.q_layernorm.weight', 'language_model.model.layers.29.self_attn.k_layernorm.bias', 'language_model.model.layers.34.mlp.dense_4h_to_h.weight', 'language_model.model.layers.30.mlp.dense_h_to_4h.bias', 'language_model.model.layers.0.input_layernorm.bias', 'language_model.model.layers.18.self_attn.query_key_value.weight', 'language_model.model.layers.1.mlp.dense_h_to_4h.bias', 'language_model.model.layers.26.mlp.dense_h_to_4h.weight', 'language_model.model.layers.8.post_attention_layernorm.weight', 'language_model.model.layers.18.self_attn.dense.bias', 'language_model.model.layers.30.mlp.dense_4h_to_h.bias', 'language_model.model.layers.7.mlp.dense_h_to_4h.bias', 'language_model.model.layers.31.self_attn.dense.weight', 'language_model.model.layers.9.self_attn.query_key_value.weight', 'language_model.model.layers.12.input_layernorm.bias', 'language_model.model.layers.14.self_attn.q_layernorm.weight', 'language_model.model.layers.28.self_attn.dense.bias', 'language_model.model.layers.6.self_attn.q_layernorm.bias', 'language_model.model.layers.30.self_attn.query_key_value.bias', 'language_model.model.layers.11.self_attn.q_layernorm.weight', 'language_model.model.layers.33.self_attn.dense.bias', 'language_model.model.layers.14.mlp.dense_h_to_4h.bias', 'language_model.model.layers.14.mlp.dense_4h_to_h.bias', 'language_model.model.layers.12.mlp.dense_h_to_4h.weight', 'language_model.model.layers.10.self_attn.dense.weight', 'language_model.model.layers.5.self_attn.k_layernorm.weight', 'language_model.model.layers.33.mlp.dense_h_to_4h.weight', 'language_model.model.layers.17.mlp.dense_4h_to_h.weight', 'language_model.model.layers.19.self_attn.dense.bias', 'language_model.model.layers.4.mlp.dense_4h_to_h.bias', 'language_model.model.layers.19.self_attn.query_key_value.weight', 'language_model.model.layers.8.input_layernorm.bias', 'language_model.model.layers.6.self_attn.k_layernorm.bias', 'language_model.model.layers.31.self_attn.dense.bias', 'language_model.model.layers.25.self_attn.query_key_value.bias', 'language_model.model.layers.34.self_attn.q_layernorm.weight', 'language_model.model.layers.7.input_layernorm.bias', 'language_model.model.layers.2.self_attn.k_layernorm.bias', 'language_model.model.layers.29.self_attn.q_layernorm.bias', 'language_model.model.layers.16.self_attn.query_key_value.bias', 'language_model.model.layers.35.mlp.dense_h_to_4h.weight', 'language_model.model.layers.35.post_attention_layernorm.weight', 'language_model.model.layers.1.self_attn.dense.weight', 'language_model.model.layers.4.mlp.dense_h_to_4h.bias', 'language_model.model.layers.15.input_layernorm.bias', 'language_model.model.layers.4.post_attention_layernorm.weight', 'language_model.model.layers.14.input_layernorm.weight', 'language_model.model.layers.22.mlp.dense_4h_to_h.bias', 'language_model.model.layers.11.input_layernorm.weight', 'language_model.model.layers.27.self_attn.k_layernorm.bias', 'language_model.model.layers.18.mlp.dense_4h_to_h.bias', 'language_model.model.layers.25.mlp.dense_h_to_4h.bias', 'language_model.model.layers.32.input_layernorm.bias', 'language_model.model.layers.10.mlp.dense_h_to_4h.weight', 'language_model.model.layers.14.self_attn.k_layernorm.weight', 'language_model.model.layers.8.post_attention_layernorm.bias', 'language_model.model.layers.27.self_attn.dense.bias', 'language_model.model.layers.21.self_attn.k_layernorm.weight', 'language_model.model.layers.27.self_attn.q_layernorm.weight', 'language_model.model.layers.30.self_attn.dense.weight', 'language_model.model.layers.23.mlp.dense_4h_to_h.bias', 'language_model.model.layers.18.post_attention_layernorm.weight', 'language_model.model.layers.22.self_attn.q_layernorm.weight', 'language_model.model.layers.13.self_attn.dense.bias', 'language_model.model.layers.14.self_attn.query_key_value.bias', 'language_model.model.layers.10.self_attn.k_layernorm.bias', 'language_model.model.layers.34.input_layernorm.bias', 'language_model.model.layers.3.post_attention_layernorm.bias', 'language_model.model.layers.5.input_layernorm.weight', 'language_model.model.layers.8.self_attn.query_key_value.weight', 'language_model.model.layers.27.post_attention_layernorm.bias', 'language_model.model.layers.28.mlp.dense_h_to_4h.weight', 'language_model.model.layers.28.self_attn.q_layernorm.weight', 'language_model.model.layers.5.mlp.dense_4h_to_h.weight', 'language_model.model.layers.19.self_attn.dense.weight', 'language_model.model.layers.21.input_layernorm.weight', 'language_model.model.layers.14.post_attention_layernorm.bias', 'language_model.model.layers.35.self_attn.query_key_value.bias', 'language_model.model.layers.10.mlp.dense_4h_to_h.weight', 'language_model.model.layers.17.self_attn.q_layernorm.bias', 'language_model.model.layers.25.input_layernorm.bias', 'language_model.model.layers.34.self_attn.dense.weight', 'language_model.model.layers.34.input_layernorm.weight', 'language_model.model.layers.5.self_attn.k_layernorm.bias', 'language_model.model.layers.2.mlp.dense_4h_to_h.weight', 'language_model.model.layers.11.self_attn.dense.bias', 'language_model.model.layers.17.mlp.dense_4h_to_h.bias', 'language_model.model.layers.13.mlp.dense_4h_to_h.bias', 'language_model.model.layers.21.self_attn.query_key_value.weight', 'language_model.model.lay
207 Saved checkpoint at epoch 1.

I try to set the safe_serialization=False, the warning disappear but the saved pytorch_model.bin only 2MB, comparing to around 18GB originally (using 4.35.0).

Expected behavior

See above

Nov 05 '23 17:11 Luodian

Thanks for reporting! I can reproduce, I'm fixing.

Nov 06 '23 12:11 LysandreJik

Hmmm actually, it seems like it was just a mistake on my end, I cannot reproduce after trying again.

If you load the fuyu model, save it, and reload it once again, do you have an error? Does it only happen after fine-tuning?

Nov 06 '23 13:11 LysandreJik

Hmmm actually, it seems like it was just a mistake on my end, I cannot reproduce after trying again.

If you load the fuyu model, save it, and reload it once again, do you have an error? Does it only happen after fine-tuning?

ohh I see, you dont have it on transformers==4.36.0?? I had this issue on two of my instances. Let's me figure out the details.

I think the problem may comes from if you try to use accelerator + deepspeed zero3 to wrap the model?

The error happens at my script in this line, you can take a look at the model specific configs if that could provide more contexts!

https://github.com/Luodian/Otter/blob/ca69589b7e4475c9e87836de30e7fc91bbee74b6/pipeline/train/instruction_following.py#L523

Nov 06 '23 13:11 Luodian

Thanks for sharing! I'm trying to reproduce

Nov 06 '23 13:11 LysandreJik

But if reproducing it is difficult for your side, I could share you more information and my possible guess when I'm more available!

Now at least I could run my all code with 4.35.0. And I think this issue would also remind other users to prevent it somehow.

Nov 06 '23 13:11 Luodian

Understood, it's likely it indeed comes from the safe serialization then. Do you have a command I can run using Otter? I'd like to dive in and see what may fail, I see you have different ways of saving the checkpoint:

https://github.com/Luodian/Otter/blob/ca69589b7e4475c9e87836de30e7fc91bbee74b6/pipeline/train/train_utils.py#L229-L262

Nov 06 '23 13:11 LysandreJik

Here's a minimal one:

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_zero2.yaml \
    --num_processes=1 \
    pipeline/train/instruction_following.py \
    --pretrained_model_name_or_path=adept/fuyu-8b \
    --training_data_yaml=./Demo_Data.yaml \
    --model_name=fuyu \
    --instruction_format=fuyu \
    --batch_size=1 \
    --gradient_accumulation_steps=2 \
    --num_epochs=3 \
    --external_save_dir=./checkpoints \
    --run_name=Fuyu_Save_Tester \
    --wandb_project=Fuyu \
    --workers=${WORKERS} \
    --lr_scheduler=cosine \
    --learning_rate=1e-5 \
    --warmup_steps_ratio=0.03 \
    --save_hf_model \
    --max_seq_len=1024 \
    --logging_steps=1000 \
    --keep_symbols \
    --save_ckpt_each_epoch \
    --dynamic_resolution \
    --with_task_description

The data could be set at Demo_Data.yaml and the files are downloaded from: instruction.json file at here and the images.parquet file at here.

Nov 06 '23 13:11 Luodian

Understood, it's likely it indeed comes from the safe serialization then. Do you have a command I can run using Otter? I'd like to dive in and see what may fail, I see you have different ways of saving the checkpoint:

https://github.com/Luodian/Otter/blob/ca69589b7e4475c9e87836de30e7fc91bbee74b6/pipeline/train/train_utils.py#L229-L262

basically the errors comes from --save_hf_model

Nov 06 '23 13:11 Luodian

Ok, let me try this. Are you tying weights yourself that weren't originally tied in the fuyu model?

Nov 06 '23 13:11 LysandreJik

Ok, let me try this. Are you tying weights yourself that weren't originally tied in the fuyu model?

What does this mean sorry? I didnt try other models, I think Fuyu also didnt have specifically manipulation of model saving process. It calls the save_pretrained from modeling_utils.py I guess?

Nov 06 '23 13:11 Luodian

Ok that works no problem :) I was just making sure you weren't tying some weights yourself within the model, as this might go wrong on reload.

I'm currently debugging your script, will report here.

Edit: handed it out to the fantastic @muellerzr

Nov 06 '23 13:11 LysandreJik

oh yes, the model weights are directly from adept/fuyu-8b.

But we have our implementations inside modeling_persimmon.py, this is the base LLM of Fuyu. It mainly about the throughput optimization (improved 4x) and integration of flash attention and fused operators.

Does them count for the error? I think both the following versions would err with 4.36.0.

One of my instance has flash attention so it calls our version of modeling_persimmon.py.
Another instance didnt have flash attention so it calls transformers modeling_persimmon.py

The logic is:

try:
    from .modeling_persimmon import PersimmonForCausalLM

    print("Using local PersimmonForCausalLM with Flash Attention")
except ImportError:
    from transformers import PersimmonForCausalLM

    print("Using transformers PersimmonForCausalLM without Flash Attention")

Nov 06 '23 13:11 Luodian

Working on trying to reproduce this :)

Nov 06 '23 15:11 muellerzr

Successfully could reproduce, a minimal repr is below:

import torch
from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin, HfDeepSpeedConfig
from transformers import AutoModelForCausalLM
from transformers.modeling_utils import unwrap_model

transformers_config = HfDeepSpeedConfig({
    "train_micro_batch_size_per_gpu": 2,
    "gradient_accumulation_steps": 2,
    "gradient_clipping": 1.0,
    "offload_optimizer_device": None,
    "offload_param_device": None,
    "zero3_init_flag": False,
    "zero_optimization": {
    "stage": 2,
    },
})

plugin = DeepSpeedPlugin(transformers_config)

accelerator = Accelerator(deepspeed_plugin=plugin)

model_name = "bert-base-cased"
model = AutoModelForCausalLM.from_pretrained(model_name)

opt = torch.optim.Adam(model.parameters(), lr=1e-5)

model, opt = accelerator._prepare_deepspeed(model, opt)

state_dict = accelerator.get_state_dict(model)

model = unwrap_model(model)
model.save_pretrained(
    "testing_fuyu_8b",
    state_dict=state_dict,
    safe_serialization=True
)

Nov 06 '23 16:11 muellerzr

@Luodian can you try again when num_processes >1? I couldn't reproduce it.

I can only reproduce your main example here because currently Accelerate doesn't really support single-GPU deepspeed

Nov 06 '23 16:11 muellerzr

sorry Im little busy these days, I may report it later, but may not very soon.

Nov 08 '23 02:11 Luodian

I have the same issue, “Removed shared tensor”. Transformers 4.35.2, using deepspeed on 1 gpu. Following the comments here, I disabled deepspeed and now it is saving correctly.

I imagine if you are getting this error, you are running deepspeed on a 1 gpu machine.

Nov 16 '23 17:11 Quetzalcohuatl

But I truly had this issue when using 2 or 8 GPUs with deepspeed zero3. 🧐

Nov 16 '23 17:11 Luodian

Agree. I got the same issue when I just ran it on my 8gpu instance with deepspeed. I even downgraded to 4.35.0 and still have the same issue.

basically my code saves a bert module in one folder, and saves the overall model in another folder. I hypothesize that when saving with safetensors, if it noticed that you are saving duplicate weights and biases, it saves the full thing once and when you try re-saving it, it will remove the shared modules (to save on disk space, I guess). In my case, it was removing all of my layers except for the tail Embedding layer.

luckily for me, setting safe_serialization = False fixed it for me. I hope you can figure out how to fix yours too @Luodian

Nov 16 '23 21:11 Quetzalcohuatl

Bu the way, in case it matters, I am using deepspeed zero stage 0, but for Trainer it only began to use dp16 and gradient checkpointing and stuff when I pass the deepspeed config (even though it is stage 0)

Nov 16 '23 21:11 Quetzalcohuatl

Agree. I got the same issue when I just ran it on my 8gpu instance with deepspeed. I even downgraded to 4.35.0 and still have the same issue.

basically my code saves a bert module in one folder, and saves the overall model in another folder. I hypothesize that when saving with safetensors, if it noticed that you are saving duplicate weights and biases, it saves the full thing once and when you try re-saving it, it will remove the shared modules (to save on disk space, I guess). In my case, it was removing all of my layers except for the tail Embedding layer.

luckily for me, setting safe_serialization = False fixed it for me. I hope you can figure out how to fix yours too @Luodian

yes, I use 4.35.1 and safe_serialization=False solved my issue. And I am also gonna fix in this version until this issue be full addressed (in deepspeed zero0/1/2/3, and multi-gpus).

Nov 17 '23 03:11 Luodian

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 11 '23 08:12 github-actions[bot]

Encountered the error while using:

accelerate=0.25.0
transformers=4.36.2

Single gpu, not using deepspeed. Accelerate config:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
dynamo_config:
  dynamo_backend: INDUCTOR
gpu_ids: '0'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

When calling accelerator.save_state(dir) to save a flan-t5-small model, I get:

Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading

and then when I reload the model accelerator.load_state(dir), I get:

RuntimeError: Error(s) in loading state_dict for T5ForConditionalGeneration:
        Missing key(s) in state_dict: "encoder.embed_tokens.weight", "decoder.embed_tokens.weight".

Calling accelerator.save_state(dir, safe_serialization=False) works, but doesn't solve the underlying problem. Calling accelerate.save_state(dir) and then accelerate.load_state(dir) shouldn't throw an error. Why is safe_serialization removing these two shared tensors? Not sure what is the best solution, but this should be automatically handled.

Dec 27 '23 16:12 GabPrato

Hi @GabPrato, I had the same issue with Accelerate while using for single GPU. Using, safe_serialization=False in accelerate.save_state() resolved it.

Dec 30 '23 16:12 AfrinDange

i have the same issue

Jan 21 '24 05:01 Deriq-Qian-Dong

cc @muellerzr @pacman100

Jan 21 '24 17:01 amyeroberts

Hi all, please explain more about how you're using Accelerator.save_state() here please? We don't expose that part of the API in the Trainer, so how that is being called could be the root of our issue (as the before error fully and completely passes)

As well as full and complete code

Jan 23 '24 15:01 muellerzr

I tried with every latest version of the packages, and the reproducer runs for me.

Happy to help whenever there's a reproducer !

Jan 23 '24 16:01 Narsil

I run the same code with the following codes and got the issue. https://github.com/huggingface/trl/issues/1121

Jan 24 '24 07:01 Deriq-Qian-Dong

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 17 '24 08:02 github-actions[bot]

transformers transformers copied to clipboard

Shared tensors not correctly saved.

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard