FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

When finetune the 7b model, I get runtime error: The size of tensor a (65537024) must match the size of tensor b (262148096)

Open Arain-sh opened this issue 1 year ago • 3 comments

RuntimeError: The size of tensor a (65537024) must match the size of tensor b (262148096) at non-singleton dimension 0

site-packages/torch/optim/adamw.py:273 in │ │ single_tensor_adamw │ │ │ │ 270 │ │ param.mul(1 - lr * weight_decay) │ │ 271 │ │ │ │ 272 │ │ # Decay the first and second moment running average coefficient │ │ ❱ 273 │ │ exp_avg.mul_(beta1).add_(grad, alpha=1 - beta1) │ │ 274 │ │ exp_avg_sq.mul_(beta2).addcmul_(grad, grad, value=1 - beta2) │ │ 275 │ │ │ │ 276 │ │ if capturable: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: The size of tensor a (65537024) must match the size of tensor b (262148096) at non-singleton dimension 0 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38467 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 38469 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 38468) of binary: /data/miniconda3/envs/arainmodel/bin/python Traceback (most recent call last): File "/data/miniconda3/envs/arainmodel/bin/torchrun", line 8, in sys.exit(main()) File "/data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main run(args) File "/data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/miniconda3/envs/arainmodel/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Arain-sh avatar Apr 21 '23 08:04 Arain-sh

In my case, you have to mask sure one epoch cannot be finished in a single step, note that here one step is counted as one step with backward pass according to 1, namely, the following equation should be satisfied,

$$\text{num gpus} * \text{per device train batch size} * \text{gradient accumulation steps} < len(dataset)$$

not quite exactly know why...

CaraJ7 avatar Apr 22 '23 04:04 CaraJ7

In my case, you have to mask sure one epoch cannot be finished in a single step, note that here one step is counted as one step with backward pass according to 1, namely, the following equation should be satisfied,

num gpus∗per device train batch size∗gradient accumulation steps<len(dataset)

not quite exactly know why...

Met the same issue, but I had a huge dataset that satisfied this relationship. The fact that 65537024 × 4 = 262148096 makes me think if there were something wrong with collecting the results from 4 GPUs.

There was something wrong with the code I used to generate the dataset. Instead of lots of conversations, I had a single conversation with a lot of sentences. Fixing that resolves the problem, and thank you @CaraJ7!

Metric-Void avatar Apr 26 '23 05:04 Metric-Void

@Arain-sh - it is resolved now ? do you know the root cause ?

Sparetavns avatar May 04 '23 09:05 Sparetavns

I meet with the same issue. Solution provided by CaraJ7 is right. It seems only when dataset is large enough to run at least one step train process can run successfully. I checked the code and running state. When dataset is too small, after running loss.backward(), first param of model is like: shape:(65537024), dtype: float32, but grad of it is like: shape:(262148096), dtype:bfloat16. I think it causes this problem but don't sure why. If dataset is large enough, after running loss.backward(). Both of them mentioned above are: shape:(65537024), dtype: float32. Since 262148096 is 4*65537024 and there are 4 GPUs running. I think there may exist some bugs when using transformer.Trainer in distributed training. For solving this, you can just refer to CaraJ7's answer. Still hope someone can find out the root cause.

Yuxin715d avatar Jul 26 '23 15:07 Yuxin715d