Shengqiong Wu

Results 31 comments of Shengqiong Wu

@pengxuan001, actually, the results of the previous stage training are used during the next stage of training: https://github.com/NExT-GPT/NExT-GPT/blob/e2e2f9477a110403f6b4e719ebb868c9f44f7a1b/code/model/agent.py#L17 If you want to separately save the weights trained in different stages,...

This is because we resize the embedding layer of LLM after adding new tokens, so the parameters of the embedding layer become trainable.

Sorry, I do not understand your question. If you are referring to the representations of the signal tokens, they are extracted from the final hidden layer of the LLM.

@pengxuan001, we have released the new version of the code, including all training and tuning procedures. Please refer to it.

Thank you for your interest and for bringing up this issue. So this is indeed a bug, but fortunately has a minor influence. We actually updated the code three weeks...

I think this to be a challenge because the model structure of CogView2, CogVideo, and the diffusion-based model are different. Diffusion-based models employ a separate text encoder, which extracts textual...

In the first stage, we only focus on the encoding-side alignment. Specifically, only the text-'X' pair data are utilized. As for the second stage, the embeddings generated by the text...

In our implementation, we utilize the stable-diffusion-v1.5 model for image generation, with the text encoder configured to have a hidden size of 769. However, in the stable-diffusion-2 model, the text...

Hi, thx for your interest. While we plan to release the data, we are advised to be cautious about the copyright issue. At this moment, we are in the process...

@NikhilBhargav @oroojlooy Thx for the interest. We've already released the textual MosIT dataset. Please refer to [MosIT](data/IT_data/MosIT_data). The image, video, and audio components will be rolled out gradually. Stay tuned!