FastChat icon indicating copy to clipboard operation
FastChat copied to clipboard

leave only 45 conversations in dummy.json result in error

Open luckyfish0826 opened this issue 2 years ago • 5 comments

at first we edit the dummy.json file, changed the "my name is Vicuna" as "my name is XXXXX", and keep all the other conversations (total 910) , then trained it, the new model works fine in English output, by failed when we asked it with other languages.

so in order to find out the problem, we made the same change and leave only the 45 conversations about "who are you" (delete other 865 conversations), then trained it. This time we faced below error message:

RuntimeError: The size of tensor a (32768512) must match the size of tensor b (262148096) at non-singleton dimension 0

all the other detail traceback is below. Any one can help?

Not sure whether this belongs to an issue, yet we could not find better place to resolve this problem.


2023-05-08 10:33:13.000 [INFO] [Driver] Traceback (most recent call last): 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/source/FastChat/fastchat/train/train_mem.py"", line 13, in " 2023-05-08 10:33:13.000 [INFO] [Driver] train() 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/source/FastChat/fastchat/train/train.py"", line 245, in train" 2023-05-08 10:33:13.000 [INFO] [Driver] trainer.train() 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/transformers/trainer.py"", line 1662, in train" 2023-05-08 10:33:13.000 [INFO] [Driver] return inner_training_loop( 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/transformers/trainer.py"", line 1996, in inner_training_loop" 2023-05-08 10:33:13.000 [INFO] [Driver] self.optimizer.step() 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/lr_scheduler.py"", line 68, in wrapper" 2023-05-08 10:33:13.000 [INFO] [Driver] return wrapped(*args, **kwargs)" 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/optimizer.py"", line 140, in wrapper" 2023-05-08 10:33:13.000 [INFO] [Driver] out = func(*args, **kwargs)" 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/autograd/grad_mode.py"", line 27, in decorate_context" 2023-05-08 10:33:13.000 [INFO] [Driver] return func(*args, **kwargs)" 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/adamw.py"", line 162, in step" 2023-05-08 10:33:13.000 [INFO] [Driver] adamw(params_with_grad," 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/adamw.py"", line 219, in adamw" 2023-05-08 10:33:13.000 [INFO] [Driver] func(params," 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/adamw.py"", line 273, in single_tensor_adamw" 2023-05-08 10:33:13.000 [INFO] [Driver] exp_avg.mul(beta1).add(grad, alpha=1 - beta1)" 2023-05-08 10:33:13.000 [INFO] [Driver] RuntimeError: The size of tensor a (32768512) must match the size of tensor b (262148096) at non-singleton dimension 0 2023-05-08 10:33:14.000 [INFO] [Driver] ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/source/FastChat/fastchat/train/train_mem.py:13 in │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 10 from fastchat.train.train import train │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 11 │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 12 if name == ""main"": │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 13 │ train() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 14 │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/source/FastChat/fastchat/train/train.py:245 in train │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 242 │ if list(pathlib.Path(training_args.output_dir).glob(""checkpoint-*"" │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 243 │ │ trainer.train(resume_from_checkpoint=True) │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 244 │ else: │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 245 │ │ trainer.train() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 246 │ trainer.save_state() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 247 │ safe_save_model_for_hf_trainer(trainer=trainer, output_dir=trainin │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 248 │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/transformer │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ s/trainer.py:1662 in train │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1659 │ │ inner_training_loop = find_executable_batch_size( │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1661 │ │ ) │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 1662 │ │ return inner_training_loop( │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1663 │ │ │ args=args, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1665 │ │ │ trial=trial, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/transformer │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ s/trainer.py:1996 in _inner_training_loop │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1993 │ │ │ │ │ │ scale_after = self.scaler.get_scale() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1994 │ │ │ │ │ │ optimizer_was_run = scale_before <= scale_aft │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1995 │ │ │ │ │ else: │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 1996 │ │ │ │ │ │ self.optimizer.step() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1997 │ │ │ │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1998 │ │ │ │ │ if optimizer_was_run and not self.deepspeed: │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 1999 │ │ │ │ │ │ self.lr_scheduler.step() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /lr_scheduler.py:68 in wrapper │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 65 │ │ │ │ instance = instance_ref() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 66 │ │ │ │ instance.step_count += 1 │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 67 │ │ │ │ wrapped = func.get(instance, cls) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 68 │ │ │ │ return wrapped(*args, **kwargs) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 69 │ │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 70 │ │ │ # Note that the returned function here is no longer a bou │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 71 │ │ │ # so attributes like __func__ and __self__ no longer │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /optimizer.py:140 in wrapper │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 137 │ │ │ │ obj, * = args │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 138 │ │ │ │ profile_name = ""Optimizer.step#{}.step"".format(obj._c │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 139 │ │ │ │ with torch.autograd.profiler.record_function(profile_n │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 140 │ │ │ │ │ out = func(*args, **kwargs) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 141 │ │ │ │ │ obj.optimizer_step_code() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 142 │ │ │ │ │ return out │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 143 │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/autog │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ rad/grad_mode.py:27 in decorate_context │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 24 │ │ @functools.wraps(func) │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 25 │ │ def decorate_context(*args, **kwargs): │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 26 │ │ │ with self.clone(): │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 27 │ │ │ │ return func(*args, **kwargs) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 28 │ │ return cast(F, decorate_context) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 29 │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 30 │ def wrap_generator(self, func): │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /adamw.py:162 in step │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 159 │ │ │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 160 │ │ │ │ state_steps.append(state['step']) │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 161 │ │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 162 │ │ │ adamw(params_with_grad, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 163 │ │ │ │ grads, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 164 │ │ │ │ exp_avgs, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 165 │ │ │ │ exp_avg_sqs, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /adamw.py:219 in adamw │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 216 │ else: │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 217 │ │ func = single_tensor_adamw │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 218 │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 219 │ func(params, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 220 │ │ grads, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 221 │ │ exp_avgs, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 222 │ │ exp_avg_sqs, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /adamw.py:273 in single_tensor_adamw │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 270 │ │ param.mul(1 - lr * weight_decay) │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 271 │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 272 │ │ # Decay the first and second moment running average coefficien │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 273 │ │ exp_avg.mul(beta1).add(grad, alpha=1 - beta1) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 274 │ │ exp_avg_sq.mul(beta2).addcmul(grad, grad, value=1 - beta2) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 275 │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 276 │ │ if capturable: │ 2023-05-08 10:33:14.000 [INFO] [Driver] ╰──────────────────────────────────────────────────────────────────────────────╯ 2023-05-08 10:33:14.000 [INFO] [Driver] RuntimeError: The size of tensor a (32768512) must match the size of tensor b 2023-05-08 10:33:14.000 [INFO] [Driver] (262148096) at non-singleton dimension 0

luckyfish0826 avatar May 09 '23 12:05 luckyfish0826

update: both pop same error in 0.2.3 and 0.2.5

luckyfish0826 avatar May 10 '23 03:05 luckyfish0826

Do you have gradient accumulation steps larger than your dataset size?

gxy-gxy avatar Jun 07 '23 03:06 gxy-gxy

Do you have gradient accumulation steps larger than your dataset size?

not quite sure about this. In my case, I changed nothing but the dummy.json file. Seems there is a minimum conversation count required, after testing, we found it's about 100. really wired.

luckyfish0826 avatar Jun 09 '23 09:06 luckyfish0826

oh, I met the same problem before. But I found it's because I use a small dataset and set a big accumulate step larger than dataset size. It become normal after I change the accumulate step. Maybe your conditional is similar. Hope this can inspire you!

gxy-gxy avatar Jun 09 '23 11:06 gxy-gxy

thank you, I am new to LLM. I basically understand your point, I'll try and see what's coming

luckyfish0826 avatar Jun 12 '23 02:06 luckyfish0826