descript-audio-codec icon indicating copy to clipboard operation
descript-audio-codec copied to clipboard

Fixes wrong output dimensions in ConvTranspose1d

Open ecobost opened this issue 8 months ago • 0 comments

Solves #42, #58 and #68: all related to incorrect computation of output shape in the ConvTraspose1d of the DecoderBlock (as also pointed out in PR #44).

When using stride > 1 in a conv operation the output dimensions are underdetermined and ConvTranspose1d needs extra info (the output_padding) to compute the expected output (see note in docs).

Given the construction constraints of the conv/deconv operations (namely, kernel_size=stride/2, padding=ceil(stride/2)), I figured out the right output_padding (so we always recover the same input dimensions) is:

if s is even:
	output_padding = 0 if input_timesteps is divisible by stride, else 1
If stride is odd:
	output_padding = 0  if input_timesteps + 1 is divisible by stride, else 1

with input_timesteps = timestestep dimension of the input to the original conv1d.

This PR sets output_padding=0 for even strides and 1 for odd strides. This will work in the vast majority of cases (including for all pretrained models) except when: 1: if stride is even and input_timesteps is not divisible by stride. 2: if stride is odd and input_timesteps+1 is divisible by stride. Both of which are unlikely ( and case 1 would fail anyway even without this PR). At the very least, I believe the current setting is a more sensible default.

ecobost avatar Jun 28 '24 12:06 ecobost