autosub
autosub copied to clipboard
Maximise audio quality - conversion workflow
not that much into the code, it seems to do some wav/flac conversion works by breaking the audio to pieces and then upload them i am not sure as to what order , the details, can someone explain it? (fine with flowcharts?)
it seems to down-sample the audio i am not sure what bit depth/sample rates are allowed,
but it e.g. is output 16kHz, 16bit integer
if mp3, prefer use of
-c:a mp3float
otherwise it converts to 16-bit-integer (quality difference) (something to do with the frequency-encoding)
maybe maximise the use of the limited 16-bit dynamic range, and headroom optimize
via (if not already in the "native-upload format")
- maybe excessive? oversample maybe to 192khz or 384kHz using max quality settings
- apply some dynamic-normalization
- downsample+dither
-filter:a aresample=384000:resampler=soxr:precision=33:osf=dbl:cutoff=0.98:osf=dbl,dynaudnorm=g=63:b=1:c=1,aresample=44100:resampler=soxr:precision=33:cutoff=0.91:osf=flt
FFmpeg Resampler Documentation - soxr is better than ffmpeg's default
Dynamic Audio Normalizer
reason for this maybe this might somewhat slight change word output, accuracy? issues
haven't studied its sensitivity
if audio → flac → wav → upload
&if down-sample to 16?bit
16?kHz
occurs @ flac to wav stage
when audio is mp3(via float-decode) or aac or opus-ogg,
since it's decoded as a 32-bit float,
then save the flac as 24-bit
to preserve dynamic range
Using flowchart may be too complicated. Anyway let me explain this. However it is not that necessary to ask for higher audio quality, due to the api itself may not need that higher quality audio clips. If you don't know well about the speech-to-text api used by this software, you can go to #111 . Of course what you say is really something that perhaps influence the audio quality. I didn't realize it before. You can refactor the codes to get a much better audio processing workflow. And then open a pull request.
I fix this problem(partially) in my repo. Now conversion is separated. .wav(48kHz/16bit/mono) for regions find and .flac(44.kHz/24bit/mono) for speech api. Details in CHANGELOG.md. @daT4v1s
I just commit a feature about pre-process audio using this workflow but controlled by the autosub itself. issue #40
Default pre-process commands need ffmpeg-normalize. Of course you can write it youself by using the -apc
input options. But remember to set pre-processing output format to 44.kHz/24bit/mono flac. Currently I don't write the logic to judge the output format. It will be used directly by speech-to-text method. And when that method cut the clips, it use copy arg so it is very risky when your format isn't proper.
My repo You can install it from pip. Or wait for me to release. I write pretty some features now. I think I will release it in a few more days.
I've already released the standalone version. Click here and download.
Also, if you are not satisfied with the current conversion command, you can manually replace it by using -acc
/--audio-conversion-cmd
.
Apart from that, you can also do the conversion outside the autosub. You can manually input -ap n
to override the conversion.
More info in my repo's readme.