vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

what is the exact audio format requirements ?

Open JourneyToSilius opened this issue 2 years ago • 7 comments

Could you explain to me what is the exact audio file requirements ? I don't understand why it works when I convert MP3 to WAV but it doesn't when I convert from PCM or OGG to WAV ( all audio is the same source, it comes from Amazon Polly )

It would be great to use straight WAV without the need to convert it, to preserve quality

Thanks

JourneyToSilius avatar May 17 '22 01:05 JourneyToSilius

I don't understand why it works when I convert MP3 to WAV but it doesn't when I convert from PCM or OGG to WAV ( all audio is the same source, it comes from Amazon Polly )

We don't understand either, you'd better share the files if you need help

It would be great to use straight WAV without the need to convert it, to preserve quality

Nothing stops you here

nshmyrev avatar May 17 '22 08:05 nshmyrev

This is the file on MP3 format. If I convert this to WAV, I can process the speech correctly in Vosk http://sndup.net/88qg

MP3 format metadata:

encoding | mp3
-- | --
format | fltp
number_of_channel | 1 (mono)
sample_rate | 24000
file_size | 20061 byte
duration | 3.336s

MP3 converted to WAV:
http://sndup.net/fwpp

WAV metadata :


encoding | pcm_s16le
-- | --
format | s16
number_of_channel | 1 (mono)
sample_rate | 24000
file_size | 160172 byte
duration | 3.336s

This is another file obtained with the same procedure, but in OGG format. If I convert this file to WAV, my Vosk code doesn't get any result http://sndup.net/v5zh

OGG format metadata:


encoding | vorbis
-- | --
format | fltp
number_of_channel | 1 (mono)
sample_rate | 24000
file_size | 21215 byte
duration | 3.28s

OGG converted to WAV : http://sndup.net/nsvf

WAV format metadata:

encoding | pcm_s32le
-- | --
format | s32
number_of_channel | 1 (mono)
sample_rate | 24000
file_size | 314924 byte
duration | 3.28s

I have also tested raw PCM but I can't share PCM through this service

            if wf.getnchannels() != 1 or wf.getsampwidth() != 2 or wf.getcomptype() != "NONE":
                print ("Audio file must be WAV format mono PCM.")
                exit (1)

I have added this audio test procedure from your tests, but it passes all the time. So I don't really know what's wrong with the PCM and OGG files. Maybe it's the conversion procedure ?

Thank you very much

JourneyToSilius avatar May 17 '22 14:05 JourneyToSilius

Hi. The second file has sample size 32. We need 16:

format | s32

to force conversion to 16 bit use the following command:

ffmpeg -i <input_file> -ar 16000 -ac 1 -acodec pcm_s16le file.wav

and it will work fine

nshmyrev avatar May 24 '22 21:05 nshmyrev

thanks :) I kind of thought about this when I posted it here, but I went to do something else instead and wanted to fix it later. Thank you for your confirmation

JourneyToSilius avatar May 24 '22 21:05 JourneyToSilius

Great. We might need a method to check the file in the API one day.

nshmyrev avatar May 24 '22 21:05 nshmyrev

that'd be great, but I think it would be enough for now if you add it to the documentation perhaps ? That way you can avoid people complaining about it. Anyways, thanks for the effort !

JourneyToSilius avatar May 24 '22 22:05 JourneyToSilius

format should be wav mono, not wavex. If a built-in decoder is used. ffmpeg accepts any format. wavex is an extensible format, has a different file header, but a typical wav extension.

ghost avatar May 24 '22 22:05 ghost