FastChat
FastChat copied to clipboard
When finetune the 7b model, I get runtime error: The size of tensor a (65537024) must match the size of tensor b (262148096)
RuntimeError: The size of tensor a (65537024) must match the size of tensor b (262148096) at non-singleton dimension 0
site-packages/torch/optim/adamw.py:273 in │
│ single_tensor_adamw │
│ │
│ 270 │ │ param.mul(1 - lr * weight_decay) │
│ 271 │ │ │
│ 272 │ │ # Decay the first and second moment running average coefficient │
│ ❱ 273 │ │ exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) │
│ 274 │ │ exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) │
│ 275 │ │ │
│ 276 │ │ if capturable: │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: The size of tensor a (65537024) must match the size of tensor b (262148096) at non-singleton dimension 0
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38467 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38469 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 38468) of binary: /data/miniconda3/envs/arainmodel/bin/python
Traceback (most recent call last):
File "/data/miniconda3/envs/arainmodel/bin/torchrun", line 8, in
In my case, you have to mask sure one epoch cannot be finished in a single step, note that here one step is counted as one step with backward pass according to 1, namely, the following equation should be satisfied,
$$\text{num gpus} * \text{per device train batch size} * \text{gradient accumulation steps} < len(dataset)$$
not quite exactly know why...
In my case, you have to mask sure one epoch cannot be finished in a single step, note that here one step is counted as one step with backward pass according to 1, namely, the following equation should be satisfied,
num gpus∗per device train batch size∗gradient accumulation steps<len(dataset)
not quite exactly know why...
Met the same issue, but I had a huge dataset that satisfied this relationship. The fact that 65537024 × 4 = 262148096 makes me think if there were something wrong with collecting the results from 4 GPUs.
There was something wrong with the code I used to generate the dataset. Instead of lots of conversations, I had a single conversation with a lot of sentences. Fixing that resolves the problem, and thank you @CaraJ7!
@Arain-sh - it is resolved now ? do you know the root cause ?
I meet with the same issue. Solution provided by CaraJ7 is right. It seems only when dataset is large enough to run at least one step train process can run successfully. I checked the code and running state. When dataset is too small, after running loss.backward(), first param of model is like: shape:(65537024), dtype: float32, but grad of it is like: shape:(262148096), dtype:bfloat16. I think it causes this problem but don't sure why. If dataset is large enough, after running loss.backward(). Both of them mentioned above are: shape:(65537024), dtype: float32. Since 262148096 is 4*65537024 and there are 4 GPUs running. I think there may exist some bugs when using transformer.Trainer in distributed training. For solving this, you can just refer to CaraJ7's answer. Still hope someone can find out the root cause.