mdx-net How to train a model that can fully extract the 44100hz frequency

How to train a model that can fully extract the 44100hz frequency

Open dingjibang opened this issue 2 years ago • 8 comments

I want to train a 2 stems model

I noticed that in the yaml configuration of each model, there are some parameters that will affect the final frequency cutoff, it seems that multigpu_drums.yaml can handle the full 44100hz frequency, but with the reduction of num_blocks (11 => 9), the model size will also decrease accordingly (29mb => 21mb).

Although using something like multigpu_drums.yaml can handle 44100hz in full, but the model shrinks instead. Does this affect the final accuracy?

It seems that dim_t, hop_length, overlap, num_blocks these parameters have a wonderful complementarity that I cannot understand, maybe this 'complementarity' is designed for the competition(mix to demucs), but I want to apply this to the real world without demucs(only mdx-net, after some testing, I think the potential of mdx-net is higher than demucs).

When I try to change num_blocks from 9 to 11, the results of inference have overlapping and broken voices... do you have any good parameters recommendations for me to train a full 44100hz one without loss of accuracy (i.e. the model does not Shrinking)

Mar 28 '22 05:03 dingjibang

Disclaimer: I'm not part of the original team, my collaborator role here is to update some documentations.

I don't fully understand what you mean, but I think what you're trying to achieve here is to train models that do not have a frequency cutoff?

If so, maybe take a look at their presentation slides which mentions that:

Larger n_fft with Frequency cutoff

use n_fft of 16384 for bass

cutoff (freq > 2048)

And try to change the both num_blocks and dim_f (2^num_blocks=dim_f?) and related parameters?

Mar 29 '22 08:03 Zokhoi

Currently I am using the below configuration to train the results without freq cutoff ↓

num_blocks: 9
l: 3
g: 32
k: 3
bn: 8
bias: False

n_fft: 4096
dim_f: 2048
dim_t: 128
dim_c: 4
hop_length: 1024

overlap: 2048

Although it works, and no freq cutoff, BUT the generated onnx/ckpt files are smaller than the pretrained vocals/bass/others file

file size
my onnx without freq cutoff: 21.417MB
pretrained onnx with freq cutoff: vocals/bass/others: 29.008MB

So what I want to ask is:

Does the reduction of the model file mean that the information contained is also reduced, thus affecting the quality of the model
If 1 is true, how can the model file be more larger without frequency cutoff

Mar 29 '22 09:03 dingjibang

For 1, from my understanding, for reduction of num_blocks, you are decreasing the number of intermediate blocks/layers that the model would recognize patterns on, so there are less info in the model for the fewer layers. Paper on TFC-TDF-U-Net v1 Brief explanation of what the intermediate blocks do

For 2, from the paper on MDX-Net:

... high frequencies above the target source’s expected frequency range were cut off from the mixture spectrogram. This way, we can increase n_fft while using the same input spectrogram size (which we needed to constrain for the separation time limit), and using a larger n_fft usually leads to better SDR. It is also why we did not use a multi-target model (a single model that is trained to estimate all four sources), where we could not use source-specific frequency cutting.

So probably for having no frequency cutoff, you would want n_fft and dim_f to be the same. If were to increase the model size, probably increase dim_f and num_blocks. Also, here's a brief explanation of the frequency cutoff from the author.

Mar 29 '22 11:03 Zokhoi

Thanks for your reply (double thanks^_^)

Change n_fft and dim_f to same will cause an error

RuntimeError: Error instantiating 'src.models.mdxnet.ConvTDFNet' : Trying to create tensor with negative dimension -2047: [1, 4, -2047, 256]

Error stack & source code: src/models/mdxnet.py#L33

self.freq_pad = nn.Parameter(torch.zeros([1, dim_c, self.n_bins - self.dim_f, self.dim_t]), requires_grad=False)

It seems that n_fft and dim_f cannot be the same in the code, dim_f must less n_fft / 2 to work properly

Sorry I'm a layman in this field and don't know much about these complex things...I just want to get a correct config to train😭😭

Mar 29 '22 13:03 dingjibang

Hi @dingjibang, Can you share the inference code that you used below?

When I try to change num_blocks from 9 to 11, the results of inference have overlapping and broken voices..

with a audio sample?

Thank you @Zokhoi for contributions by the way.

Mar 29 '22 13:03 ws-choi

Conv_TDF_net_trim(
    device=device, load=load,
    model_name='Conv-TDF', target_name='guitar',
    lr=0.0002, epoch=470,
    L=9, l=3, g=32, bn=8, bias=False, <== when I change num_blocks from 9 to 11, this "L" value always changed
    dim_f=11, dim_t=7
)

and n_fft_scale['guitar'] = 2

I found that overlapping and broken sound were caused by too little training time, I was too impatient... After training both quickly for 10 epochs, the above problems did not exist

So things seem to end very simply😭. The other parameters of the above configuration remain unchanged. Just increase num_blocks seems to increase the size of the final model.

Sorry for an extra question, does n_fft also affect the final quality?(I don't consider the time cost of training) If so, in the above configuration, how to safely increase this value, do I need to change other associated parameter values?

Thank you

Mar 29 '22 15:03 dingjibang

@ws-choi What is the importance to this line

 self.n_bins = n_fft // 2 + 1

for n_bins to be half of n_fft? Is it because of sampling theorem?

@dingjibang I think that as the harmonic series for instruments like bass are squashed in one frequency region instead of across the spectrum, having a larger n_fft with fixed dim_f would help classify only the lower frequencies into more bins and thus being clearer for the model to find patterns that corresponds to those compressed bass harmonic series, so probably that's why "using a larger n_fft usually leads to better SDR" when the spectrogram size is fixed. For different instruments the region on the spectrum they occupy are different, so there are different upper limits for each instruments' frequency cutoffs, and when scaled to the same dim_f, the n_fft for different instruments would be different. From what I can read from the code, probably if you change n_fft you would also need to change dim_f to retain the ratio between them (n_fft:dim_f 2:1 for no cutoff?).

Mar 29 '22 16:03 Zokhoi

mdx-net mdx-net copied to clipboard

How to train a model that can fully extract the 44100hz frequency

mdx-net
mdx-net copied to clipboard