InternLM-XComposer finetune时如何支持单卡组batch训练呢？

如题，finetune时如何支持单卡内组batch训练呢？

我把finetune.sh中的--batch_size和--per_device_train_batch_size参数都改成了2，启动训练后会在下面位置报错 https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/modeling_internlm_xcomposer2.py#L266

报错原因是两个样本的token seq长度不一致，无法进行concatenate操作。

Feb 19 '24 11:02 Marcovaldon

You can ensure the equal number of <ImageHere> in the same JSON file. Or you can manually add an extra padding operator before concatenating.

Feb 19 '24 12:02 yuhangzang

I tried to add an extra padding operator before concatenating, then there is another error:

Traceback (most recent call last): File "finetune.py", line 312, in train() File "finetune.py", line 302, in train trainer.train() File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1553, in train return inner_training_loop( File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2690, in training_step self.accelerator.backward(loss) File "/usr/local/lib/python3.8/dist-packages/accelerate/accelerator.py", line 1960, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/engine.py", line 1890, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1953, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py", line 157, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/usr/local/lib/python3.8/dist-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 871, in reduce_partition_and_remove_grads self.reduce_ready_partitions_and_remove_grads(param, i) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1332, in reduce_ready_partitions_and_remove_grads self.reduce_independent_p_g_buckets_and_remove_grads(param, i) File "/usr/local/lib/python3.8/dist-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 906, in reduce_independent_p_g_buckets_and_remove_grads assert self.params_already_reduced[param_id] == False,
AssertionError: The parameter 935 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

How to debug it?

Feb 20 '24 06:02 Marcovaldon

@myownskyW7 @yuhangzang I used only one sample to train with batch>1, there is the same error above.

Feb 20 '24 07:02 Marcovaldon

Can you provide more details on how you add the padding?

Feb 20 '24 11:02 yuhangzang

Can you provide more details on how you add the padding?

After https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/modeling_internlm_xcomposer2.py#L266, I implement the padding procedure by below:

longest_token_num = max([wrap_embeds_list[j].shape[1] for j in range(len(img_list))])

    for i in range(len(img_list)):
        padding_num = longest_token_num - wrap_embeds_list[i].shape[1]
        if padding_num == 0: continue
        pad1 = torch.zeros((1, padding_num, 4096)).to(self.device)
        pad1 = pad1.type_as(wrap_embeds_list[i])
        wrap_embeds_list[i] = torch.cat([wrap_embeds_list[i], pad1], dim=1)
        pad2 = torch.zeros((1, padding_num)).to(self.device)
        pad2 = pad2.type_as(wrap_atts_list[i])
        wrap_atts_list[i] = torch.cat([wrap_atts_list[i], pad2], dim=1)
        pad3 = torch.ones((1, padding_num)).to(self.device) * self.tokenizer.pad_token_id
        pad3 = pad3.type_as(wrap_target_list[i])
        wrap_target_list[i] = torch.cat([wrap_target_list[i], pad3], dim=1)
        pad4 = torch.zeros((1, padding_num)).to(self.device)
        pad4 = pad4.type_as(wrap_im_mask_list[i])
        wrap_im_mask_list[i] = torch.cat([wrap_im_mask_list[i], pad4], dim=1)

By the way, I used the original code and used only one sample (repeated 10w times) to train with batch > 1, there is the same error.

AssertionError: The parameter 935 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

Feb 20 '24 11:02 Marcovaldon

You may avoid in-place variable replacement and define other variables instead.

Feb 21 '24 12:02 yuhangzang

You may avoid in-place variable replacement and define other variables instead.

但我删掉自己的改动，用原始代码训练，只有一条数据重复10w次作为数据集，batch>1时仍会有同样的报错信息。

Feb 21 '24 12:02 Marcovaldon

I mean you may avoid the in-place replacement such as wrap_embeds_list[i] = torch.cat([wrap_embeds_list[i], pad1], dim=1), and define new variables, e.g., wrap_embeds_list_new[i] = torch.cat([wrap_embeds_list[i], pad1], dim=1)

Feb 22 '24 10:02 yuhangzang

I mean you may avoid the in-place replacement such as wrap_embeds_list[i] = torch.cat([wrap_embeds_list[i], pad1], dim=1), and define new variables, e.g., wrap_embeds_list_new[i] = torch.cat([wrap_embeds_list[i], pad1], dim=1)

试过了还是相同的报错信息。想问一下作者在finetune的时候有组batch吗？

我用最原始的代码做了一下尝试：训练数据为相同的10w case，此时在https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/modeling_internlm_xcomposer2.py#L266处组batch时是可以跑过去的（同一条case得到的序列长度肯定是相同的），但在后面做梯度反传的时候还是会有下面的问题：

AssertionError: The parameter 935 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

Feb 26 '24 03:02 Marcovaldon

I mean you may avoid the in-place replacement such as wrap_embeds_list[i] = torch.cat([wrap_embeds_list[i], pad1], dim=1), and define new variables, e.g., wrap_embeds_list_new[i] = torch.cat([wrap_embeds_list[i], pad1], dim=1)

试过了还是相同的报错信息。想问一下作者在finetune的时候有组batch吗？

我用最原始的代码做了一下尝试：训练数据为相同的10w case，此时在https://huggingface.co/internlm/internlm-xcomposer2-vl-7b/blob/main/modeling_internlm_xcomposer2.py#L266处组batch时是可以跑过去的（同一条case得到的序列长度肯定是相同的），但在后面做梯度反传的时候还是会有下面的问题：

AssertionError: The parameter 935 has already been reduced. Gradient computed twice for this partition. Multiple gradient reduction is currently not supported

您好，请问padding的数值是如何设置的？有什么参考吗？

May 21 '24 09:05 decreasbetter

We have already supported the batch size > 1 in the fine-tuning code for IXC 2.5. You can refer to our implementation.

Jul 17 '24 09:07 yuhangzang

Hi, @yuhangzang . Thanks for this great work! In my opinion, the batch_size here works more similar to sequence packing. I think you should rename it to advoid some mistakes.

Feb 08 '25 11:02 Coobiw