autotrain-advanced icon indicating copy to clipboard operation
autotrain-advanced copied to clipboard

[BUG] Loading auto train model gets size mismatch on AutoModelForCausalLM

Open BRM10213 opened this issue 1 year ago • 12 comments

Prerequisites

  • [X] I have read the documentation.
  • [X] I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

autotrain llm \
--train \
--project-name "$PROJECT_NAME" \
--model "$MODEL_NAME" \
--data-path "$DATA_PATH" \
--text_column "$TEXT_COLUMN" \
--use-peft \
--quantization "$QUANTIZATION" \
--lr "$LEARNING_RATE" \
--train-batch-size "$BATCH_SIZE" \
--epochs "$EPOCHS" \
--trainer "$TRAINER" \
--model_max_length "$MAX_LENGTH" \
--block_size "$BLOCK_SIZE" \
> training.log 2>&1 &
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "/opt/huggingface/hub/CodeLlama-34b-Instruct-001"

tokenizer = AutoTokenizer.from_pretrained(model_path)

from_pretrained_kwargs = {
        'torch_dtype': torch.float32,
        'revision': 'main'
}
model = AutoModelForCausalLM.from_pretrained(
                model_path,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                **from_pretrained_kwargs,
)

input_text = "Health benefits of regular exercise"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids)
predicted_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(predicted_text)

UI Screenshots & Parameters

No response

Error Logs

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00, 1.09it/s] Traceback (most recent call last): File "/home/azureuser/test_model.py", line 17, in model = AutoModelForCausalLM.from_pretrained(model_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3933, in from_pretrained model.load_adapter( File "/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/transformers/integrations/peft.py", line 206, in load_adapter incompatible_keys = set_peft_model_state_dict(self, processed_adapter_state_dict, adapter_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/peft/utils/save_and_load.py", line 241, in set_peft_model_state_dict load_result = model.load_state_dict(peft_model_state_dict, strict=False) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/torch/nn/modules/module.py", line 2152, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM: size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]). size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192])

Additional Information

After training the model using autotrain, I try to load the model into AutoModelForCausalLM, but it throws an error, as there is a difference in the dimensions of the generated model

BRM10213 avatar Feb 16 '24 06:02 BRM10213

seems like a transformer issue. could you merge the model and then try the same code? here is code for merging:

def merge_adapter(base_model_path, target_model_path, adapter_path):
    logger.info("Loading adapter...")
    model = AutoModelForCausalLM.from_pretrained(
        base_model_path,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
        trust_remote_code=True,
    )

    tokenizer = AutoTokenizer.from_pretrained(
        target_model_path,
        trust_remote_code=True,
    )
    model.resize_token_embeddings(len(tokenizer))

    model = PeftModel.from_pretrained(model, adapter_path)
    model = model.merge_and_unload()

    logger.info("Saving target model...")
    model.save_pretrained(target_model_path)
    tokenizer.save_pretrained(target_model_path)

requirements: peft==0.8.2 transformers==4.37.0

abhishekkrthakur avatar Feb 16 '24 08:02 abhishekkrthakur

Thank you for your response @abhishekkrthakur

I used your merge_adapter over my fine-tuning model and it created these new files:

model-00001-of-00014.safetensors
model-00002-of-00014.safetensors
....
model-00014-of-00014.safetensors
tokenizer_config.json
model.safetensors.index.json
special_tokens_map.json
added_tokens.json
tokenizer.json

After it finished saving the merge model, I try again to run my python script and use AutoModelForCausalLM, but I still get the same error:

RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).
	size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192])

BRM10213 avatar Feb 16 '24 14:02 BRM10213

can you remove the old adapter files when you reload?

abhishekkrthakur avatar Feb 16 '24 14:02 abhishekkrthakur

I move my adapter_config.json file in other path location and this is my new tree directory of the merge fine-tuning model

tokenizer.model
training_args.bin
requirements.txt
handler.py
training_params.json
checkpoint-26
README.md
adapter_model.safetensors
generation_config.json
config.json
model-00001-of-00014.safetensors
model-00002-of-00014.safetensors
model-00003-of-00014.safetensors
model-00004-of-00014.safetensors
model-00005-of-00014.safetensors
model-00006-of-00014.safetensors
model-00007-of-00014.safetensors
model-00008-of-00014.safetensors
model-00009-of-00014.safetensors
model-00010-of-00014.safetensors
model-00011-of-00014.safetensors
model-00012-of-00014.safetensors
model-00013-of-00014.safetensors
model-00014-of-00014.safetensors
tokenizer_config.json
model.safetensors.index.json
special_tokens_map.json
added_tokens.json
tokenizer.json

When I run the python script, it seems that load correctly AutoModelForCausalLM buth now it has some errors when it generates the inference

 python3 test_model.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 18.21it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/transformers/generation/utils.py:1128: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
Traceback (most recent call last):
  File "/home/user/test_model.py", line 28, in <module>
    output = model.generate(input_ids)
.... 
  File "/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'

BRM10213 avatar Feb 16 '24 14:02 BRM10213


model = AutoModelForCausalLM.from_pretrained(
            base_model_path,
            torch_dtype=torch.float16,
            low_cpu_mem_usage=True,
            trust_remote_code=True,
            device_map="auto",
        )

abhishekkrthakur avatar Feb 16 '24 14:02 abhishekkrthakur

Thank you @abhishekkrthakur, the test finally works!

This is my test code:

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer


model_path = "/opt/huggingface/hub/CodeLlama-34b-Instruct-001"

tokenizer = AutoTokenizer.from_pretrained(model_path)

from_pretrained_kwargs = {
        'torch_dtype': torch.float16,
        'revision': 'main'
}
model = AutoModelForCausalLM.from_pretrained(
                model_path,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                device_map="cuda",
                **from_pretrained_kwargs,
)

input_text = "Health benefits of regular exercise"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda")
output = model.generate(input_ids)
predicted_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(predicted_text)

And this is my output when I run

python3 test_model.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:07<00:00,  1.95it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/transformers/generation/utils.py:1128: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
  warnings.warn(
<s> Health benefits of regular exercise<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>

Finally, my doubt is way generated these characters: <s><unk><unk>?

Is because my data_set could contain this stranger values?

Maybe this is the main reason for what my model increase its vocabulary.

Thank you again.

BRM10213 avatar Feb 16 '24 14:02 BRM10213

glad to hear it worked. is there still an issue?

abhishekkrthakur avatar Feb 16 '24 15:02 abhishekkrthakur

No, this issue is already resolved.

I just want to figure out way my train model add these <s><unk><unk> characters. I think my data set has wrong encoding or there are some characters that are not visible in it, and that causes my fit to change its dimension.

Thank you for all @abhishekkrthakur

BRM10213 avatar Feb 16 '24 15:02 BRM10213

its a bit difficult to answer that question without looking at the training data and the training parameters used. does it happen with all datasets?

abhishekkrthakur avatar Feb 16 '24 15:02 abhishekkrthakur

In previous versions of autotrain and transformers, I utilized a different dataset with autotrain without encountering any issues. I didn't even need to merge the fine-tuning model; I could simply run it directly in AutoModelForCausalLM.

Let me conduct some tests with the dataset I mentioned earlier, and I will share my results in this thread.

BRM10213 avatar Feb 16 '24 15:02 BRM10213

I load my first data_set that I used in my first training with the current version of autotrain and I get the same error

RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).
	size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).

First training

This is the environment in a virtual machine, where I start to do some trains:

peft==0.5.0
transformers==4.34.0
autotrain-advanced==0.6.37

And this is the command that I used for train

autotrain llm \
--train \
--project_name '/opt/huggingface/hub/CodeLlama-34b-Instruct-Sybase' \
--model 'codellama/CodeLlama-34b-Instruct-hf' \
--data_path '/home/psadmin/Fine_Tuning' \
--use_peft \
--use-int8 \
--learning_rate 2e-4 \
--train_batch_size 10 \
--num_train_epochs 8 \
--trainer sft > training_sql.log &

That's all that I need to run my fine-tuning model

Current training

This is the environment in my current training

peft==0.8.2
transformers==4.37.0
autotrain-advanced==0.6.9

And this is the command that I used for train

autotrain llm \
--train \
--text_column 'TEXT' \
--project-name 'CodeLlama-34b-001' \
--model 'codellama/CodeLlama-34b-Instruct-hf' \
--data-path '/home/azureuser/Fine_Tuning' \
--peft \
--quantization int8 \
--lr 2e-4 \
--batch-size 10 \
--epochs 8 \
--trainer sft

Here is where I have the issue when I load the fine-tuning in transformers.

BRM10213 avatar Feb 16 '24 18:02 BRM10213

I compare the base model and fine-tuning model and I found this:

diff /opt/huggingface/hub/models--codellama--CodeLlama-34b-Instruct-hf/snapshots/cebb11eacbeecb9189e910d57a8faeadb949978f/special_tokens_map.json CodeLlama-34b-Instruct-Sybase/special_tokens_map.json
1a2,7
>   "additional_special_tokens": [
>     "▁<PRE>",
>     "▁<MID>",
>     "▁<SUF>",
>     "▁<EOT>"
>   ],
diff /opt/huggingface/hub/models--codellama--CodeLlama-34b-Instruct-hf/snapshots/cebb11eacbeecb9189e910d57a8faeadb949978f/tokenizer.json CodeLlama-34b-Instruct-Sybase/tokenizer.json
31a32,67
>     },
>     {
>       "id": 32000,
>       "content": "▁<PRE>",
>       "single_word": false,
>       "lstrip": false,
>       "rstrip": false,
>       "normalized": false,
>       "special": true
>     },
>     {
>       "id": 32001,
>       "content": "▁<MID>",
>       "single_word": false,
>       "lstrip": false,
>       "rstrip": false,
>       "normalized": false,
>       "special": true
>     },
>     {
>       "id": 32002,
>       "content": "▁<SUF>",
>       "single_word": false,
>       "lstrip": false,
>       "rstrip": false,
>       "normalized": false,
>       "special": true
>     },
>     {
>       "id": 32003,
>       "content": "▁<EOT>",
>       "single_word": false,
>       "lstrip": false,
>       "rstrip": false,
>       "normalized": false,
>       "special": true
diff /opt/huggingface/hub/models--codellama--CodeLlama-34b-Instruct-hf/snapshots/cebb11eacbeecb9189e910d57a8faeadb949978f/tokenizer_config.json CodeLlama-34b-Instruct-Sybase/tokenizer_config.json
4,10c4,60
<   "bos_token": {
<     "__type": "AddedToken",
<     "content": "<s>",
<     "lstrip": false,
<     "normalized": true,
<     "rstrip": false,
<     "single_word": false
---
>   "added_tokens_decoder": {
>     "0": {
>       "content": "<unk>",
>       "lstrip": false,
>       "normalized": false,
>       "rstrip": false,
>       "single_word": false,
>       "special": true
>     },
>     "1": {
>       "content": "<s>",
>       "lstrip": false,
>       "normalized": false,
>       "rstrip": false,
>       "single_word": false,
>       "special": true
>     },
>     "2": {
>       "content": "</s>",
>       "lstrip": false,
>       "normalized": false,
>       "rstrip": false,
>       "single_word": false,
>       "special": true
>     },
>     "32000": {
>       "content": "▁<PRE>",
>       "lstrip": false,
>       "normalized": false,
>       "rstrip": false,
>       "single_word": false,
>       "special": true
>     },
>     "32001": {
>       "content": "▁<MID>",
>       "lstrip": false,
>       "normalized": false,
>       "rstrip": false,
>       "single_word": false,
>       "special": true
>     },
>     "32002": {
>       "content": "▁<SUF>",
>       "lstrip": false,
>       "normalized": false,
>       "rstrip": false,
>       "single_word": false,
>       "special": true
>     },
>     "32003": {
>       "content": "▁<EOT>",
>       "lstrip": false,
>       "normalized": false,
>       "rstrip": false,
>       "single_word": false,
>       "special": true
>     }

Perhaps this is related to the first error I report:

RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).
	size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).

I have the impression that these special characters are the main reason for the exception: tokenization_code_llama.py

I even created a dataset validation Python script to be sure if my dataset might contain any unknown tokens, but I didn't find any.

from transformers import PreTrainedTokenizerFast

# Assume that all your tokenizer files are in the same directory as `model_path`
model_path = "/opt/huggingface/hub/models--codellama--CodeLlama-34b-Instruct-hf/snapshots/cebb11eacbeecb9189e910d57a8faeadb949978f"

# Load the tokenizer
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{model_path}/tokenizer.json",
                                    model_max_length=512,  # Adjust according to your configuration
                                    pad_token='[PAD]',
                                    eos_token='[EOS]',
                                    unk_token='[UNK]',
                                    )

print("Tokenizer successfully loaded.")

# Path to your training file
#training_file_path = "/home/hf/Fine_Tuning_T/fine-tuning-sybase-1.csv"
training_file_path = "/home/hf/Fine_Tuning/processed_data.csv"

# List to store unknown tokens
unknown_tokens = set()

with open(training_file_path, "r", encoding="utf-8") as file:
    for line in file:
        # Tokenize the line and truncate if necessary
        encoded_line = tokenizer.encode(line,
                                        add_special_tokens=False,
                                        truncation=True,  # Enable truncation
                                        max_length=512)  # Make sure not to exceed the maximum length
        # Look for unknown tokens
        for token_id in encoded_line:
            if tokenizer.convert_ids_to_tokens(token_id) == tokenizer.unk_token:
                unknown_tokens.add(token_id)

if unknown_tokens:
    print("Unknown tokens found:")
    for token_id in unknown_tokens:
        print(f"ID: {token_id}, Token: {tokenizer.convert_ids_to_tokens(token_id)}")
else:
    print("No unknown tokens found.")

Finally, if I use my merge model with inference, I only get unknown tags in the response.

import torch

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "/opt/huggingface/hub/CodeLlama-34b-Sybase"

tokenizer = AutoTokenizer.from_pretrained(model_path)

from_pretrained_kwargs = {
        'torch_dtype': torch.float16,
        'revision': 'main'
}
model = AutoModelForCausalLM.from_pretrained(
                model_path,
                low_cpu_mem_usage=True,
                trust_remote_code=True,
                #vocab_size=32000,
                device_map="cuda",
                **from_pretrained_kwargs,
)

input_text = "Health benefits of regular exercise"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda")
output = model.generate(input_ids, max_new_tokens=200)
predicted_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(predicted_text)

BRM10213 avatar Feb 20 '24 05:02 BRM10213

This issue is stale because it has been open for 15 days with no activity.

github-actions[bot] avatar Mar 11 '24 15:03 github-actions[bot]

This issue was closed because it has been inactive for 2 days since being marked as stale.

github-actions[bot] avatar Mar 22 '24 15:03 github-actions[bot]