autosub icon indicating copy to clipboard operation
autosub copied to clipboard

Maximise audio quality - conversion workflow

Open daT4v1s opened this issue 5 years ago • 5 comments

not that much into the code, it seems to do some wav/flac conversion works by breaking the audio to pieces and then upload them i am not sure as to what order , the details, can someone explain it? (fine with flowcharts?)

it seems to down-sample the audio i am not sure what bit depth/sample rates are allowed,

but it e.g. is output 16kHz, 16bit integer

if mp3, prefer use of -c:a mp3float otherwise it converts to 16-bit-integer (quality difference) (something to do with the frequency-encoding)

maybe maximise the use of the limited 16-bit dynamic range, and headroom optimize

via (if not already in the "native-upload format")

  1. maybe excessive? oversample maybe to 192khz or 384kHz using max quality settings
  2. apply some dynamic-normalization
  3. downsample+dither

-filter:a aresample=384000:resampler=soxr:precision=33:osf=dbl:cutoff=0.98:osf=dbl,dynaudnorm=g=63:b=1:c=1,aresample=44100:resampler=soxr:precision=33:cutoff=0.91:osf=flt FFmpeg Resampler Documentation - soxr is better than ffmpeg's default Dynamic Audio Normalizer

reason for this maybe this might somewhat slight change word output, accuracy? issues

haven't studied its sensitivity

if audio → flac → wav → upload  &if down-sample to 16?bit 16?kHz occurs @ flac to wav stage  when audio is mp3(via float-decode) or aac or opus-ogg,   since it's decoded as a 32-bit float,  then save the flac as 24-bit   to preserve dynamic range

daT4v1s avatar Jun 28 '19 05:06 daT4v1s

Using flowchart may be too complicated. Anyway let me explain this. However it is not that necessary to ask for higher audio quality, due to the api itself may not need that higher quality audio clips. If you don't know well about the speech-to-text api used by this software, you can go to #111 . Of course what you say is really something that perhaps influence the audio quality. I didn't realize it before. You can refactor the codes to get a much better audio processing workflow. And then open a pull request.

BingLingGroup avatar Jul 08 '19 03:07 BingLingGroup

I fix this problem(partially) in my repo. Now conversion is separated. .wav(48kHz/16bit/mono) for regions find and .flac(44.kHz/24bit/mono) for speech api. Details in CHANGELOG.md. @daT4v1s

BingLingGroup avatar Jul 13 '19 04:07 BingLingGroup

I just commit a feature about pre-process audio using this workflow but controlled by the autosub itself. issue #40

Default pre-process commands need ffmpeg-normalize. Of course you can write it youself by using the -apc input options. But remember to set pre-processing output format to 44.kHz/24bit/mono flac. Currently I don't write the logic to judge the output format. It will be used directly by speech-to-text method. And when that method cut the clips, it use copy arg so it is very risky when your format isn't proper.

My repo You can install it from pip. Or wait for me to release. I write pretty some features now. I think I will release it in a few more days.

BingLingGroup avatar Jul 20 '19 14:07 BingLingGroup

I've already released the standalone version. Click here and download.

BingLingGroup avatar Jul 30 '19 12:07 BingLingGroup

Also, if you are not satisfied with the current conversion command, you can manually replace it by using -acc/--audio-conversion-cmd.

Apart from that, you can also do the conversion outside the autosub. You can manually input -ap n to override the conversion.

More info in my repo's readme.

BingLingGroup avatar Aug 06 '19 04:08 BingLingGroup