DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

Inquiry about Embedding Concatenation in DeepSpeed-VisualChat

Open teslacool opened this issue 1 year ago • 4 comments

@yaozhewei First, I'd like to extend my gratitude for the incredible work you've been doing with DeepSpeedExamples. It's truly commendable and has been a great resource for the community.

As I was exploring the code, particularly the section where language and visual embeddings are concatenated, I came across something that prompted a question. I noticed in this line of code:

https://github.com/microsoft/DeepSpeedExamples/blob/60e412eaa7275212e240f31055fc8b814ebe653f/applications/DeepSpeed-VisualChat/utils/model/modeling_dsvl.py#L226

that img_pos_list is reversed during the insertion of the image embedding. However, it appears that cur_img is not reversed in the process. Could there potentially be a mismatch between the visual and text information due to this? I'm curious if this is intentional for alignment purposes or if it might be an oversight.

I would appreciate any clarification you can provide on this matter.

teslacool avatar Nov 06 '23 03:11 teslacool

我制作了一个13w的图文对训练集 vis_encoder = 'clip-vit-large-patch14' lang_encoder 为我以前微调过的语言模型 --7B Chinesellama, 训练参数 lr =1e-3,epoch=6,warmup=200

第一次采用原代码没有做任何修改,训练6epochs后,模型到3-4epoch就几乎降不下去了,loss稳定在2.1左右, 6个epochs后最终loss~1.95,eval_loss 2.2, 已经过拟合了 实际中eval-loss最好也只在2.1左右 模型实际验证,效果非常差。

第二次正在训练,将上面那行代码改成如下: for img_i, img_pos in zip(cur_img, img_pos_list): 按顺序做拼接, 目前epoch=2.3, loss~1.98, eval-loss=2.16 个人感觉,此次收敛速度快了些,但最终结果不好说

等我后续的验证结果

X-jun-0130 avatar Nov 14 '23 01:11 X-jun-0130

Removing it directly doesn't seem quite right. I recommend keeping it and applying the flip operation to both 'cur_img' and 'img_pos_list'.

teslacool avatar Nov 14 '23 01:11 teslacool

Hi both, Sorry for the late reply. You two are likely right, we should apply flip operator for both lists. The reason why we need to do reverse insertions is that: if the original order is used, then the inserted position of the second (and any later) img will change due to the insertions of the first image.

I left the DeepSpeed team and now I do not have any gpu access to validate this. @jeffra @tjruwase could you find someone in the team to help verify this?

yaozhewei avatar Nov 14 '23 02:11 yaozhewei

Removing it directly doesn't seem quite right. I recommend keeping it and applying the flip operation to both 'cur_img' and 'img_pos_list'.

You are right. I got a better model than last time.

X-jun-0130 avatar Nov 16 '23 03:11 X-jun-0130