ControlNet icon indicating copy to clipboard operation
ControlNet copied to clipboard

Why is normalization of condition images and target images different?

Open saltacc opened this issue 2 years ago • 2 comments

When creating my own dataset based off of the dataset tutorial, I noticed that the source images and target images are normalized between different values

    # Normalize source images to [0, 1].
    source = source.astype(np.float32) / 255.0

    # Normalize target images to [-1, 1].
    target = (target.astype(np.float32) / 127.5) - 1.0

I couldn't find any information on why these are normalized differently in the paper or on github. Is this difference important?

saltacc avatar Feb 19 '23 10:02 saltacc

Oh, does this have to do with ReLU not being able to output negative values?

saltacc avatar Feb 19 '23 11:02 saltacc

https://github.com/lllyasviel/ControlNet/commit/330064db78c54b6a9710e883f1c885ffadc3f7c7 looks like it was purposefully changed to be this way recently

saltacc avatar Feb 19 '23 11:02 saltacc

image

The new conditioning information (hint; edge maps; normal maps; etc) are encoded using a convnet Epsilon that is trained during training. Since this is a separate network, it doesn't really matter if inputs are normalized to [0, 1] or [-1, +1], or [0, 255.0], it should still work.

Noticed that this network, code here: https://github.com/lllyasviel/ControlNet/blob/main/cldm/cldm.py#L146-L162

has 3 2d-conv layers with stride=2. Each of these conv layers reduce the spatial dimension of the input feature map in half. The combined effect of the entire model is to reduce the input feature map by a factor of 8 (=2 ** 3). This is consistent with the reduction factor of the VAE in SD (512 (image size) / 64 (latent feature map size) = 8).

xiankgx avatar Feb 20 '23 00:02 xiankgx

But then why the author modify the code separately if both ranges work?

image

The new conditioning information (hint; edge maps; normal maps; etc) are encoded using a convnet Epsilon that is trained during training. Since this is a separate network, it doesn't really matter if inputs are normalized to [0, 1] or [-1, +1], or [0, 255.0], it should still work.

Noticed that this network, code here: https://github.com/lllyasviel/ControlNet/blob/main/cldm/cldm.py#L146-L162

has 3 2d-conv layers with stride=2. Each of these conv layers reduce the spatial dimension of the input feature map in half. The combined effect of the entire model is to reduce the input feature map by a factor of 8 (=2 ** 3). This is consistent with the reduction factor of the VAE in SD (512 (image size) / 64 (latent feature map size) = 8).

iLori-Jiang avatar Jun 13 '24 15:06 iLori-Jiang