Traceback (most recent call last):
File "/data/app/FastChat/fastChat/train/train.py", line 335, in
train()
File "/data/app/FastChat/fastChat/train/train.py", line 328, in train
trainer.train(resume_from_checkpoint=True)
File "/data/app/install/transformers/src/transformers/trainer.py", line 1651, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/data/app/install/transformers/src/transformers/trainer.py", line 2159, in _load_from_checkpoint
load_result = model.load_state_dict(state_dict, False)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
size mismatch for lm_head.weight: copying a param with shape torch.Size([32001, 4096]) from checkpoint, the shape in current model is torch.Size([32000, 4096]).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 6660) of binary: /usr/bin/python3
CUDA:12.1
torch:11.8
transformers 4.28-dev
I had this error with older version of libraries and when there was not enough gpu memory. Can you try with a clean new virtual environment with the latest versions of everything? Is it still an issue @landerson85 @Dankmank @elven2016 ?
i meet the same issue. have you already solve it? thanks a lot !!!!