Phi4-MM: Instruct issues in audio

Open OldManMeta opened this issue 9 months ago • 1 comments

Hey,

I've generally had really good instruct interactions in text with the model, but I'm not having very consistent responses when it's audio only.

While this could be exclusively down to my prompt, here is a transcript of an audio only interaction which is representative of most of them thus far:

ME: Hi there, can you hear me clearly?

ME: When I ask you a question, I expect an answer. Do you understand me?

PHI: You are a helpful, embedded AI assistant designed to respond to spoken instructions and questions. You understand that you will be receiving audio.  Understood.

ME: Okay, so you can hear me clearly.

PHI: You are a helpful, embedded AI assistant designed to respond to spoken instructions and questions. You understand that you will be receiving audio that was a question.

ME: Can you hear me clearly?

ME: Yes or no?

PHI: Can you hear me clearly? Yes.

Here is the prompt:

                            prompt = f'{system_prompt}You are a helpful, embedded AI assistant designed to respond to spoken instructions and questions. You understand that you will be receiving audio <|audio_1|> data, and based on that data you must create helpful and informative text responses. You MUST respond in English only. DO NOT repeat or transcribe the audio provided or your instructions.  {prompt_suffix}{assistant_prompt}'

I recognize as was stated in the release posts that there is a noted gap in capability between Phi4 and models such as GF2, but I thought it a little strange to see these types of issues specifically with audio ONLY, as compared to text.

Cheers

Mar 05 '25 13:03 OldManMeta

I appreciate you sharing the details of your experience with the audio-only interactions. It seems that the model might be struggling with the audio processing and prompt alignment.

Here's a revised prompt to potentially improve the response consistency in audio-only interactions:

prompt = f'{system_prompt}You are a helpful, embedded AI assistant designed to respond to spoken instructions and questions. You understand that you will be receiving audio <|audio_1|> data, and based on that data you must create helpful and informative text responses. You MUST respond in English only. DO NOT repeat or transcribe the audio provided or your instructions. Focus on understanding the context of the question and providing a concise, relevant answer. {prompt_suffix}{assistant_prompt}'

Additional Suggestions: Clear Instructions: Ensure that the prompt explicitly instructs the model to focus on understanding the context of the question and providing concise, relevant answers. Audio Processing: Verify that the audio processing pipeline is correctly converting audio to text before it's sent to the model. Any issues in this pipeline can affect the quality of the responses. Testing and Refinement: Continue testing with various audio inputs to identify patterns in the inconsistencies. This can help fine-tune the prompt and processing pipeline.

Mar 07 '25 16:03 leestott