gpt-neox icon indicating copy to clipboard operation
gpt-neox copied to clipboard

RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight

Open mcao516 opened this issue 3 years ago • 9 comments

Describe the bug RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([25216, 6144]) from checkpoint, the shape in current model is torch.Size([50304, 6144]).

To Reproduce

  1. Download Slim weights
  2. Update the vocabulary and checkpoint path in ./configs/20B.yml (HFTokenizer is used)
  3. Run: ./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt

Screenshots image

Environment (please complete the following information):

  • GPUs: 2x RTX8000 (48G)

mcao516 avatar Jul 07 '22 22:07 mcao516

I'm experiencing this too. Not sure what I'm doing wrong. Downloaded the weights from here which the "fixed" link from #646. However, I also downloaded the slim weights and that seems to load ok, although the output from the model is gibberish.

jdagdelen avatar Jul 16 '22 04:07 jdagdelen

I am getting the same problem too when trying to train a 1-3B model.

To Reproduce:

  1. Download Slim weights
  2. Update ./configs/1-3B.yml as shown in the screen shots below.
  3. Run python ./deepy.py train.py -d configs 1-3B.yml

Screenshots: Screen Shot 2022-12-09 at 3 20 38 PM Screen Shot 2022-12-09 at 3 20 54 PM

Environment:

  • GPU's: 4x 3090 (96G)

FayZ676 avatar Dec 09 '22 23:12 FayZ676

I also had the same problem, when using a single machine to load the slim weight downloaded on github, it reported a similar error, here is a screenshot of the error message image

Environment:

GPU's: 4x 3090 (96G)

binglun30 avatar Mar 28 '23 07:03 binglun30

What's the solution ? and why closed ?

djaym7 avatar Apr 19 '23 21:04 djaym7

@djaym7 Thanks for saying something. I don't recall closing this and have reopened it.

StellaAthena avatar Apr 19 '23 22:04 StellaAthena

@FayZ676 the url you’re linking to does not contain the weights for a 1.3B model, it contains the weights for a 20B model. That’s why you’re getting a size mismatch: it’s quite simply the wrong size. I suspect that this is unrelated to the problems the others are having.

@leclem so that change allows you to finetune the 20B model? Can you post a WandB link showing it training so I can check out the loss etc are as expected?

StellaAthena avatar Apr 30 '23 15:04 StellaAthena

I have the same issue trying to train. Downloaded slim weight and using ./config/20B.yml and running "python3 ./deepy.py train.py ./configs/20B.yml" gives this error:

RuntimeError: Error(s) in loading state_dict for EmbeddingPipe: size mismatch for word_embeddings.weight: copying a param with shape torch.Size([12608, 6144]) from checkpoint, the shape in current model is torch.Size([12672, 6144]).

shaunstoltz avatar Sep 25 '23 13:09 shaunstoltz

I suspect that this is an error that has to do with model parallelism. @shaunstoltz how many GPUs were you loading the model onto / what was the model parallelism setting?

dashstander avatar Oct 03 '23 23:10 dashstander