OneLLM
                                
                                 OneLLM copied to clipboard
                                
                                    OneLLM copied to clipboard
                            
                            
                            
                        Inference inputs multiple modalities other than text at once
Hello, I would like to ask, the current code seems to support only one modality and text modality at a time of inference, is it possible to input multiple modal data (such as audio, video and text) at a time of inference?
The current model is not trained on joint multimodal data, so it may not perform well at the test time.
The current model is not trained on joint multimodal data, so it may not perform well at the test time. But I see you run the test on Music-AVQA in thesis, could u tell me how you manage to use three modalities to generate answers?Thank u very much!
Hi @Cece1031 , hope the script in https://github.com/csuhan/OneLLM/issues/29 can help you.