Macaw-LLM icon indicating copy to clipboard operation
Macaw-LLM copied to clipboard

Questions about Model

Open RitchieAlpha opened this issue 1 year ago • 1 comments

Dear Author,

I would like to express my sincere gratitude for your open-source contributions. Your neural network model has left a deep impression on me. It seems that your model is driven by text information (CLIP aligns images and text, while Whisper aligns audio and text), and the ultimate goal of the model appears to be more inclined towards multimodal QA and multimodal captioning. However, I have the following questions:

  1. The dimensions of different modalities are vastly different. How do you balance the information from different modalities in your network?
  2. In real-world scenarios, there may be missing modalities. Do you need to input information from all three modalities during the training/inference process of your model, or can you only input certain modalities?

I am looking forward to your work and hope to see your article soon. Thank you.

Best regards, RitchieAlpha

RitchieAlpha avatar May 30 '23 01:05 RitchieAlpha