Can you tell me about the steps you do to normalize the input depth image?
When I see your code, I don't understand steps like: divide for depth scale, choose pixels greater than min depth, ...
We do not normalize the depth, we directly predict metric values also at training time. This means that no sigmoid trick is used to squeeze depth prediction in [0,1]. Therefore our depth is not rescaled based on the dataset-dependent max_depth value.
The rescaling is applied only when loading depth from the disk since it is saved as uint16. This is needed because depth is saved as png and to avoid large quantization error, depth is multiplied by, e.g., 256 or 1000.
Min depth is just a dummy value (i.e., 0.01, and GT is always > 0.01 where it exists) used to mask out invalid depth points, which are usually 0.0.