Shengqiong Wu comments

Results 31 comments of


                                            Shengqiong Wu

Do the published training weights "7b_tiva_v0" include all three stages of training results simultaneously?

@pengxuan001, actually, the results of the previous stage training are used during the next stage of training: https://github.com/NExT-GPT/NExT-GPT/blob/e2e2f9477a110403f6b4e719ebb868c9f44f7a1b/code/model/agent.py#L17 If you want to separately save the weights trained in different stages,...

There is an issue with freezing parameters

This is because we resize the embedding layer of LLM after adding new tokens, so the parameters of the embedding layer become trainable.

Question about generation process

Sorry, I do not understand your question. If you are referring to the representations of the signal tokens, they are extracted from the final hidden layer of the LLM.

Questions about stage_3_training and stage_4_training

@pengxuan001, we have released the new version of the code, including all training and tuning procedures. Please refer to it.

Should embed_tokens.weight and lm_head.weight be frozen in stage1 and stage 2

Thank you for your interest and for bringing up this issue. So this is indeed a bug, but fortunately has a minor influence. We actually updated the code three weeks...

Can the conditional diffusion model on the decoding end be replaced with other generation models, such as CogVedio?

I think this to be a challenge because the model structure of CogView2, CogVideo, and the diffusion-based model are different. Diffusion-based models employ a separate text encoder, which extracts textual...

Some questions about the embedding for the encoder

In the first stage, we only focus on the encoding-side alignment. Specifically, only the text-'X' pair data are utilized. As for the second stage, the embeddings generated by the text...

error: image_diffusion: stabilityai/stable-diffusion-2

In our implementation, we utilize the stable-diffusion-v1.5 model for image generation, with the text encoder configured to have a hidden size of 769. However, in the stable-diffusion-2 model, the text...

MosIT data

Hi, thx for your interest. While we plan to release the data, we are advised to be cautious about the copyright issue. At this moment, we are in the process...

MosIT data

@NikhilBhargav @oroojlooy Thx for the interest. We've already released the textual MosIT dataset. Please refer to [MosIT](data/IT_data/MosIT_data). The image, video, and audio components will be rolled out gradually. Stay tuned!