Marigold
Marigold copied to clipboard
Clipping is Removing Valuable Depth Estimation Values, Resulting in Squished Depth Maps
Hello everybody,
I have come across this issue while experimenting with the VAE depth decoder ‘decode_depth’ and the single inference function ‘single_infer’. The VAE decoder is not bound to the ranges of [-1,1]. In many instances, for a given image (normalized to the Stable Diffusion v2 native resolution), its decoded latent results in min-max values of around [-1.5, 1.4]. These ranges differ with respect to the image contents, aspect ratio, and in the case of inference, the initial isotropic noise.
At the end of the inference function ‘single_infer’, the decoded generated depth map is simply clipped to [-1,1]. This effectively removes valuable depth information from the generated value distribution, and thus assigns the depth value of 0 (or 1, respectively) to all values outside of [-1,1]. Intuitively, clipping results in a squished depth map. Instead, to retain the complete generated depth value distribution, it is best to swap the clipping and shifting operations to min-max normalization to [0,1]: min_depth = torch.min(depth) max_depth = torch.max(depth) depth = (depth - min_depth) / (max_depth - min_depth) depth = torch.clamp(depth, 0, 1).
This squishing also affects the final aggregated depth map, as some generated depth maps have decoded ranges closer to [-1,1], retaining these extreme depth values, while others do not. Usually, min-max normalization is not a fix in these kinds of situations. However, since the task is monocular depth estimation, the closest and farthest points must be associated with the values 0 and 1 respectively.
Please let me know if I am missing something. Best.
I've noticed this as well - curious as to why the depth was clipped too!
During the fine-tuning process, Stable Diffusion quickly adapts its latents to be within the range of [-1,1] after decoding. One can plot a histogram of the depth values of the decoded generated depth maps and see that the overwhelming majority of the depth map distribution is bound between [-1,1]. The few depth values outside of these ranges can be considered outliers. If not clipped, extreme outliers may lead to a squishing of the objects within the [-1,1] range to accommodate them. I presume that with more training time, the number of outliers will converge to 0.