Why is normalization of condition images and target images different?
When creating my own dataset based off of the dataset tutorial, I noticed that the source images and target images are normalized between different values
# Normalize source images to [0, 1].
source = source.astype(np.float32) / 255.0
# Normalize target images to [-1, 1].
target = (target.astype(np.float32) / 127.5) - 1.0
I couldn't find any information on why these are normalized differently in the paper or on github. Is this difference important?
Oh, does this have to do with ReLU not being able to output negative values?
https://github.com/lllyasviel/ControlNet/commit/330064db78c54b6a9710e883f1c885ffadc3f7c7 looks like it was purposefully changed to be this way recently

The new conditioning information (hint; edge maps; normal maps; etc) are encoded using a convnet Epsilon that is trained during training. Since this is a separate network, it doesn't really matter if inputs are normalized to [0, 1] or [-1, +1], or [0, 255.0], it should still work.
Noticed that this network, code here: https://github.com/lllyasviel/ControlNet/blob/main/cldm/cldm.py#L146-L162
has 3 2d-conv layers with stride=2. Each of these conv layers reduce the spatial dimension of the input feature map in half. The combined effect of the entire model is to reduce the input feature map by a factor of 8 (=2 ** 3). This is consistent with the reduction factor of the VAE in SD (512 (image size) / 64 (latent feature map size) = 8).
But then why the author modify the code separately if both ranges work?
The new conditioning information (hint; edge maps; normal maps; etc) are encoded using a convnet Epsilon that is trained during training. Since this is a separate network, it doesn't really matter if inputs are normalized to [0, 1] or [-1, +1], or [0, 255.0], it should still work.
Noticed that this network, code here: https://github.com/lllyasviel/ControlNet/blob/main/cldm/cldm.py#L146-L162
has 3 2d-conv layers with stride=2. Each of these conv layers reduce the spatial dimension of the input feature map in half. The combined effect of the entire model is to reduce the input feature map by a factor of 8 (=2 ** 3). This is consistent with the reduction factor of the VAE in SD (512 (image size) / 64 (latent feature map size) = 8).