tensorflow-wavenet icon indicating copy to clipboard operation
tensorflow-wavenet copied to clipboard

Wavenet parameters

Open Garygunn94 opened this issue 8 years ago • 4 comments

Can someone explain to me the meaning behind the following parameters? residual_channels": 32, dilation_channels": 32, skip_channels": 512,

I could not find a definition for what these parameters represent in the readme and so would be grateful if someone could provide a quick summary of them.

Thanks

Garygunn94 avatar May 04 '17 14:05 Garygunn94

They represent the dimensionality of several "feature vectors" along the model (the tensors have as dimensions batch, frames, and channels, not sure in which order). IIRC, if you look at the original paper (https://arxiv.org/pdf/1609.03499.pdf), Section 2.4:

  • residual_channels is the number of "channels", or "filters", of the residual output (which is the number of output channels of the top left 1x1 convolution and of the initial causal convolution.
  • dilation_channels is the number of output channels of the "Dilated conv" block.
  • skip_channels is the number of output channels of the first 1x1 convolution on the right.

You can think of 1x1 convolutions as applying a linear transformation + activation independently to each frame; I prefer to call them time-distributed (or space-distributed) layers. They are called "1x1 convolutions" because it's the easiest way to represent them in formulas, if convolutions are already available, a very easy way to implement them.

lemonzi avatar May 04 '17 15:05 lemonzi

@lemonzi

If what I am interested in is generated (weird) sound works based off of historical/contemporary experimental music recordings, I gather that the most important parameter is probably getting the receptive field size up - by extending the dilations out to e.g. 4096, having 6-8 rows of them etc (fwiw, when I tried using a larger filter_width to increase the receptive field size, I got an error saying it cannot generate files when filter_width > 2).

Obviously the other really important parameter is just taking the time to train it on audio with e.g. 44.1kHz sample rate, a large of enough library, etc.

My question here, which I haven't seen discussed in some of the other threads on audio processing/music training, is how does adjusting the "channel" parameters listed above up/down affect e.g. time/step, time to convergence, musical output etc.; also, must these always be powers of 2?

I understand basic ANN backprop algorithms thoroughly, but I don't really have an actual grasp of the causal convolution network's algorithm, so understanding the "channels" parameter might be beyond my understanding, but still thought I would ask.

delta-6400 avatar May 07 '17 03:05 delta-6400

The fast generation algorithm currently relies on the filter width being 2 (as in the original paper), that's why there's an error. The way to extend the receptive field is to indeed extend the dilations, and to stack more layers. The sample rate also has a direct effect on the receptive field, so it's better to stick to 16 or 22.05kHz until you have it all working. High-frequency content is more difficult to model anyway, as there is more noise.

Regarding the channel parameter: it's equivalent to the number of neurons in a layer for a "traditional" neural net. More channels means more operations required (longer training time), more generalisation capability (so, able to model more complex sounds), but will also require more data and more training iterations (in addition to each iteration running slower). They are usually powers of 2 because it makes it very easy to try different exponentially-spaced values to see the difference, and (to a lesser degree) because it has a nice memory layout. For instance, in the CPU operations run in blocks of 4, so it's convenient to have size be multiples of 4. See https://www.quora.com/What-do-channels-refer-to-in-a-convolutional-neural-network.

lemonzi avatar May 22 '17 22:05 lemonzi

Any ideas of the hyper-parameter values used in the original paper? I've looking for the number of blocks and layers, also the residual, skip and dilation channels.

Vichoko avatar Nov 07 '19 04:11 Vichoko