generative-ai-js Gemini 1.5 Flash 002 Hallucinates Timestamps when transcribing audio

Description of the bug:

The new flash model completely hallucinates timestamps when performing transcription.

Actual vs expected behavior:

The timestamps should be accurate based on when that word or phrase was spoken. The original flash model is excellent at this. The new model completely hallucinates.

Any other information you'd like to share?

Just simply try it. IT's so off it becomes obvious the second you try.

Sep 30 '24 15:09 JamMaster1999

Hi @JamMaster1999 ,

I have escalated this to the internal team.

Oct 01 '24 06:10 gmKeshari

Thank you

Oct 13 '24 02:10 JamMaster1999

I have tested gemini-1.5-flash, gemini-1.5-flash-002, gemini-1.5-pro, and gemini-1.5-pro-002. Among these, gemini-1.5-pro-002 performed the best for timestamping, but it was still off by a few seconds. The other models were significantly inaccurate.

I was hoping the audio_timestamp parameter in GenerationConfig would improve the accuracy, but it seems not.


vertexai_json_client = GenerativeModel('gemini-1.5-pro-002',
                                                generation_config={"response_mime_type": "application/json"})

prompt = """
    Transcribe this audio file, in the format of timestamp and caption.
    Ignore the background music and only transcribe the spoken words.
    Mark timestamp at every sentence or phrase.
    Use this JSON schema:
        {
          [
            {
              "timestamp": "00:00:00",
              "caption": "spoken words",
            }
          ]
        }
    """

Oct 20 '24 21:10 APPXOTICA

I have spent a significant amount of time testing these models as well and the original flash is flawless if you prompt it correctly. You might want to try the following:

Use only minutes/seconds 00:00
Ask it to generate on a sentenece or phrase level.
Chunk the audio into 5 minute sections. Anything longer than 5, the timestamps start to become less accurate.

I have this in production with hundreds of videos tested and the Gemini 1.5 Flash works extremely accurately. With that said, I have recently moved over to Whisper Turbo. It is more reliable overall, though not as accurate/cheap as Flash.

On Oct 20, 2024 at 2:37 PM -0700, Vincent Gigandet @.***>, wrote:

I have tested gemini-1.5-flash, gemini-1.5-flash-002, gemini-1.5-pro, and gemini-1.5-pro-002. Among these, gemini-1.5-pro-002 performed the best for timestamping, but it was still off by a few seconds. The other models were significantly inaccurate. I was hoping the Timestamp Generation Config would improve the accuracy, but it seems not. prompt = """ Transcribe this audio file, in the format of timestamp and caption. Ignore the background music and only transcribe the spoken words. Mark timestamp at every sentence or phrase. Use this JSON schema: { [ { "timestamp": "00:00:00", "caption": "spoken words", } ] } """ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Oct 20 '24 23:10 JamMaster1999

@JamMaster1999 I edited my prompt to only include minutes and seconds in the timestamps. I've tested it with videos under 3 minutes, but the timestamps generated by gemini-1.5-flash are off by more than 10 seconds, and sometimes even by a few minutes. Could you please share the prompt you're using?

Oct 22 '24 03:10 APPXOTICA

Hello, do we have some improvements on this issue? Seems like a good feature but most of the models doesn't return the correct timestamp at all.

Dec 12 '24 13:12 thhung