autotrain-advanced
autotrain-advanced copied to clipboard
[BUG] Loading auto train model gets size mismatch on AutoModelForCausalLM
Prerequisites
- [X] I have read the documentation.
- [X] I have checked other issues for similar problems.
Backend
Local
Interface Used
CLI
CLI Command
autotrain llm \
--train \
--project-name "$PROJECT_NAME" \
--model "$MODEL_NAME" \
--data-path "$DATA_PATH" \
--text_column "$TEXT_COLUMN" \
--use-peft \
--quantization "$QUANTIZATION" \
--lr "$LEARNING_RATE" \
--train-batch-size "$BATCH_SIZE" \
--epochs "$EPOCHS" \
--trainer "$TRAINER" \
--model_max_length "$MAX_LENGTH" \
--block_size "$BLOCK_SIZE" \
> training.log 2>&1 &
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "/opt/huggingface/hub/CodeLlama-34b-Instruct-001"
tokenizer = AutoTokenizer.from_pretrained(model_path)
from_pretrained_kwargs = {
'torch_dtype': torch.float32,
'revision': 'main'
}
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
**from_pretrained_kwargs,
)
input_text = "Health benefits of regular exercise"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
output = model.generate(input_ids)
predicted_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(predicted_text)
UI Screenshots & Parameters
No response
Error Logs
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:06<00:00, 1.09it/s]
Traceback (most recent call last):
File "/home/azureuser/test_model.py", line 17, in
Additional Information
After training the model using autotrain, I try to load the model into AutoModelForCausalLM, but it throws an error, as there is a difference in the dimensions of the generated model
seems like a transformer issue. could you merge the model and then try the same code? here is code for merging:
def merge_adapter(base_model_path, target_model_path, adapter_path):
logger.info("Loading adapter...")
model = AutoModelForCausalLM.from_pretrained(
base_model_path,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
target_model_path,
trust_remote_code=True,
)
model.resize_token_embeddings(len(tokenizer))
model = PeftModel.from_pretrained(model, adapter_path)
model = model.merge_and_unload()
logger.info("Saving target model...")
model.save_pretrained(target_model_path)
tokenizer.save_pretrained(target_model_path)
requirements: peft==0.8.2 transformers==4.37.0
Thank you for your response @abhishekkrthakur
I used your merge_adapter over my fine-tuning model and it created these new files:
model-00001-of-00014.safetensors
model-00002-of-00014.safetensors
....
model-00014-of-00014.safetensors
tokenizer_config.json
model.safetensors.index.json
special_tokens_map.json
added_tokens.json
tokenizer.json
After it finished saving the merge model, I try again to run my python script and use AutoModelForCausalLM, but I still get the same error:
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192])
can you remove the old adapter files when you reload?
I move my adapter_config.json file in other path location and this is my new tree directory of the merge fine-tuning model
tokenizer.model
training_args.bin
requirements.txt
handler.py
training_params.json
checkpoint-26
README.md
adapter_model.safetensors
generation_config.json
config.json
model-00001-of-00014.safetensors
model-00002-of-00014.safetensors
model-00003-of-00014.safetensors
model-00004-of-00014.safetensors
model-00005-of-00014.safetensors
model-00006-of-00014.safetensors
model-00007-of-00014.safetensors
model-00008-of-00014.safetensors
model-00009-of-00014.safetensors
model-00010-of-00014.safetensors
model-00011-of-00014.safetensors
model-00012-of-00014.safetensors
model-00013-of-00014.safetensors
model-00014-of-00014.safetensors
tokenizer_config.json
model.safetensors.index.json
special_tokens_map.json
added_tokens.json
tokenizer.json
When I run the python script, it seems that load correctly AutoModelForCausalLM buth now it has some errors when it generates the inference
python3 test_model.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:00<00:00, 18.21it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/transformers/generation/utils.py:1128: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
Traceback (most recent call last):
File "/home/user/test_model.py", line 28, in <module>
output = model.generate(input_ids)
....
File "/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: "addmm_impl_cpu_" not implemented for 'Half'
model = AutoModelForCausalLM.from_pretrained(
base_model_path,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map="auto",
)
Thank you @abhishekkrthakur, the test finally works!
This is my test code:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "/opt/huggingface/hub/CodeLlama-34b-Instruct-001"
tokenizer = AutoTokenizer.from_pretrained(model_path)
from_pretrained_kwargs = {
'torch_dtype': torch.float16,
'revision': 'main'
}
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
device_map="cuda",
**from_pretrained_kwargs,
)
input_text = "Health benefits of regular exercise"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda")
output = model.generate(input_ids)
predicted_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(predicted_text)
And this is my output when I run
python3 test_model.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 14/14 [00:07<00:00, 1.95it/s]
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
/root/anaconda3/envs/hf-env/lib/python3.11/site-packages/transformers/generation/utils.py:1128: UserWarning: Using the model-agnostic default `max_length` (=20) to control the generation length. We recommend setting `max_new_tokens` to control the maximum length of the generation.
warnings.warn(
<s> Health benefits of regular exercise<unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk><unk>
Finally, my doubt is way generated these characters: <s><unk><unk>?
Is because my data_set could contain this stranger values?
Maybe this is the main reason for what my model increase its vocabulary.
Thank you again.
glad to hear it worked. is there still an issue?
No, this issue is already resolved.
I just want to figure out way my train model add these <s><unk><unk> characters. I think my data set has wrong encoding or there are some characters that are not visible in it, and that causes my fit to change its dimension.
Thank you for all @abhishekkrthakur
its a bit difficult to answer that question without looking at the training data and the training parameters used. does it happen with all datasets?
In previous versions of autotrain and transformers, I utilized a different dataset with autotrain without encountering any issues. I didn't even need to merge the fine-tuning model; I could simply run it directly in AutoModelForCausalLM.
Let me conduct some tests with the dataset I mentioned earlier, and I will share my results in this thread.
I load my first data_set that I used in my first training with the current version of autotrain and I get the same error
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).
First training
This is the environment in a virtual machine, where I start to do some trains:
peft==0.5.0
transformers==4.34.0
autotrain-advanced==0.6.37
And this is the command that I used for train
autotrain llm \
--train \
--project_name '/opt/huggingface/hub/CodeLlama-34b-Instruct-Sybase' \
--model 'codellama/CodeLlama-34b-Instruct-hf' \
--data_path '/home/psadmin/Fine_Tuning' \
--use_peft \
--use-int8 \
--learning_rate 2e-4 \
--train_batch_size 10 \
--num_train_epochs 8 \
--trainer sft > training_sql.log &
That's all that I need to run my fine-tuning model
Current training
This is the environment in my current training
peft==0.8.2
transformers==4.37.0
autotrain-advanced==0.6.9
And this is the command that I used for train
autotrain llm \
--train \
--text_column 'TEXT' \
--project-name 'CodeLlama-34b-001' \
--model 'codellama/CodeLlama-34b-Instruct-hf' \
--data-path '/home/azureuser/Fine_Tuning' \
--peft \
--quantization int8 \
--lr 2e-4 \
--batch-size 10 \
--epochs 8 \
--trainer sft
Here is where I have the issue when I load the fine-tuning in transformers.
I compare the base model and fine-tuning model and I found this:
diff /opt/huggingface/hub/models--codellama--CodeLlama-34b-Instruct-hf/snapshots/cebb11eacbeecb9189e910d57a8faeadb949978f/special_tokens_map.json CodeLlama-34b-Instruct-Sybase/special_tokens_map.json
1a2,7
> "additional_special_tokens": [
> "▁<PRE>",
> "▁<MID>",
> "▁<SUF>",
> "▁<EOT>"
> ],
diff /opt/huggingface/hub/models--codellama--CodeLlama-34b-Instruct-hf/snapshots/cebb11eacbeecb9189e910d57a8faeadb949978f/tokenizer.json CodeLlama-34b-Instruct-Sybase/tokenizer.json
31a32,67
> },
> {
> "id": 32000,
> "content": "▁<PRE>",
> "single_word": false,
> "lstrip": false,
> "rstrip": false,
> "normalized": false,
> "special": true
> },
> {
> "id": 32001,
> "content": "▁<MID>",
> "single_word": false,
> "lstrip": false,
> "rstrip": false,
> "normalized": false,
> "special": true
> },
> {
> "id": 32002,
> "content": "▁<SUF>",
> "single_word": false,
> "lstrip": false,
> "rstrip": false,
> "normalized": false,
> "special": true
> },
> {
> "id": 32003,
> "content": "▁<EOT>",
> "single_word": false,
> "lstrip": false,
> "rstrip": false,
> "normalized": false,
> "special": true
diff /opt/huggingface/hub/models--codellama--CodeLlama-34b-Instruct-hf/snapshots/cebb11eacbeecb9189e910d57a8faeadb949978f/tokenizer_config.json CodeLlama-34b-Instruct-Sybase/tokenizer_config.json
4,10c4,60
< "bos_token": {
< "__type": "AddedToken",
< "content": "<s>",
< "lstrip": false,
< "normalized": true,
< "rstrip": false,
< "single_word": false
---
> "added_tokens_decoder": {
> "0": {
> "content": "<unk>",
> "lstrip": false,
> "normalized": false,
> "rstrip": false,
> "single_word": false,
> "special": true
> },
> "1": {
> "content": "<s>",
> "lstrip": false,
> "normalized": false,
> "rstrip": false,
> "single_word": false,
> "special": true
> },
> "2": {
> "content": "</s>",
> "lstrip": false,
> "normalized": false,
> "rstrip": false,
> "single_word": false,
> "special": true
> },
> "32000": {
> "content": "▁<PRE>",
> "lstrip": false,
> "normalized": false,
> "rstrip": false,
> "single_word": false,
> "special": true
> },
> "32001": {
> "content": "▁<MID>",
> "lstrip": false,
> "normalized": false,
> "rstrip": false,
> "single_word": false,
> "special": true
> },
> "32002": {
> "content": "▁<SUF>",
> "lstrip": false,
> "normalized": false,
> "rstrip": false,
> "single_word": false,
> "special": true
> },
> "32003": {
> "content": "▁<EOT>",
> "lstrip": false,
> "normalized": false,
> "rstrip": false,
> "single_word": false,
> "special": true
> }
Perhaps this is related to the first error I report:
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([32004, 8192]) from checkpoint, the shape in current model is torch.Size([32000, 8192]).
I have the impression that these special characters are the main reason for the exception: tokenization_code_llama.py
I even created a dataset validation Python script to be sure if my dataset might contain any unknown tokens, but I didn't find any.
from transformers import PreTrainedTokenizerFast
# Assume that all your tokenizer files are in the same directory as `model_path`
model_path = "/opt/huggingface/hub/models--codellama--CodeLlama-34b-Instruct-hf/snapshots/cebb11eacbeecb9189e910d57a8faeadb949978f"
# Load the tokenizer
tokenizer = PreTrainedTokenizerFast(tokenizer_file=f"{model_path}/tokenizer.json",
model_max_length=512, # Adjust according to your configuration
pad_token='[PAD]',
eos_token='[EOS]',
unk_token='[UNK]',
)
print("Tokenizer successfully loaded.")
# Path to your training file
#training_file_path = "/home/hf/Fine_Tuning_T/fine-tuning-sybase-1.csv"
training_file_path = "/home/hf/Fine_Tuning/processed_data.csv"
# List to store unknown tokens
unknown_tokens = set()
with open(training_file_path, "r", encoding="utf-8") as file:
for line in file:
# Tokenize the line and truncate if necessary
encoded_line = tokenizer.encode(line,
add_special_tokens=False,
truncation=True, # Enable truncation
max_length=512) # Make sure not to exceed the maximum length
# Look for unknown tokens
for token_id in encoded_line:
if tokenizer.convert_ids_to_tokens(token_id) == tokenizer.unk_token:
unknown_tokens.add(token_id)
if unknown_tokens:
print("Unknown tokens found:")
for token_id in unknown_tokens:
print(f"ID: {token_id}, Token: {tokenizer.convert_ids_to_tokens(token_id)}")
else:
print("No unknown tokens found.")
Finally, if I use my merge model with inference, I only get unknown tags in the response.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "/opt/huggingface/hub/CodeLlama-34b-Sybase"
tokenizer = AutoTokenizer.from_pretrained(model_path)
from_pretrained_kwargs = {
'torch_dtype': torch.float16,
'revision': 'main'
}
model = AutoModelForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True,
trust_remote_code=True,
#vocab_size=32000,
device_map="cuda",
**from_pretrained_kwargs,
)
input_text = "Health benefits of regular exercise"
input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda")
output = model.generate(input_ids, max_new_tokens=200)
predicted_text = tokenizer.decode(output[0], skip_special_tokens=False)
print(predicted_text)
This issue is stale because it has been open for 15 days with no activity.
This issue was closed because it has been inactive for 2 days since being marked as stale.