taskonomy icon indicating copy to clipboard operation
taskonomy copied to clipboard

testing pretrained model - depth

Open emergencyd opened this issue 6 years ago • 8 comments

image

I'm using the pretrained model in folder "rgb2depth". I want to reproduce "loss = 0.35". Which data should I use?

I've tried the "depth_zbuffer" test data, but "l1_loss, loss_g, loss_d_real, loss_d_fake" are around "0.6,0.7,0.9,0.7".

I suppose that I used wrong data or wrong loss... should I use "depth_eucliean" data?

Thank you!

emergencyd avatar Jan 19 '19 05:01 emergencyd

@alexsax could you please comment on which models we used for testing?

b0ku1 avatar Jan 20 '19 06:01 b0ku1

Also, did you use mask to mask out depth values that are "bad"? Basically we extracted loss values from mesh, and since there are holes in mesh, some depth values in ground truth are too high (we masked these values out during training).

b0ku1 avatar Jan 20 '19 21:01 b0ku1

@b0ku1 that's a good point. That might explain the relatively high losses. @emergencyd we reported the l1 loss--so that's the one that you should pay attention to.

One contributing factor might be that the models that we released were trained on an internal set of images that were processed a bit differently than the released data. The internal set always has a FoV of 75 degrees, but the released data has a range of 45-75 degrees. The pretrained networks don't work as well on images with a narrow FoV, like those in the release set. You can verify this for yourself on the Taskonomy demo site.

@emergencyd do you notice that the losses are significantly better for large-fov images?

alexsax avatar Jan 20 '19 22:01 alexsax

@emergencyd changing to rgb-large won't change the FoV (field of view) problem. That's a discrepancy between the internal and public dataset. Basically we trained and tested on images with fixed FoV (internal), but for more general use for public, the release dataset has varying FoV.

re mask: @alexsax does the released dataset come with mask?

b0ku1 avatar Jan 21 '19 04:01 b0ku1

Seconding @b0ku1 above.

And no need for an explicit mask—just check for pixels where the depth is equal to (or very close to) the max value, 2^16-1 :)

Finally, depth Euclidean is the distance from each pixel to the optical center. Depth z-buffer is something else (see the sup mat for the full description!).

alexsax avatar Jan 21 '19 18:01 alexsax

now I can see "full_plus", "full", "medium", "small" splits information of the whole dataset, but can't find the fov information of each image~ where should I get them~

Also, do I need to drop out the pixels with extremely high values? (if I understand it right)

emergencyd avatar Jan 22 '19 15:01 emergencyd

now I can see "full_plus", "full", "medium", "small" splits information of the whole dataset, but can't find the fov information of each image~ where should I get them~

The pose files :)

Also, do I need to drop out the pixels with extremely high values? (if I understand it right)

Yes, exactly.

alexsax avatar Jan 22 '19 18:01 alexsax

  1. According to supplementary materiel, I use depth_zbuffer rather than depth_euclidean as my target depth mask.

  2. Then I use the "field_of_view_rads" information to pick the images with larger than 1.3 rads.

  3. Then I use the code below to process the target image and calculate l1-loss

        #####################
        #### target data ####
        #####################
        img_t = load_raw_image_center_crop(target_name, color=False) 
        mask_filt = np.where(img_t >= 2**16 - 1, 0, 1) 

        if 1:
            img_t[img_t>=2**16-1] = 0
            img_t[img_t==0] = np.max(img_t)    
        
        if 1:
            img_t = cfg['target_preprocessing_fn']( img_t, **cfg['target_preprocessing_fn_kwargs'])   
        else:
            img_t = load_ops.resize_image(img_t, [256, 256, 1])
            
        img_t = img_t[np.newaxis, :]
        
        if 1:
            mask_filt = load_ops.resize_image(mask_filt, [256, 256, 1])        
            weight_mask = mask_filt[np.newaxis, :] 
        else:
            weight_mask = np.ones(np.shape(img_t))   
            
        #####################
        ###### predict ######
        #####################            
        predicted, representation, losses = training_runners['sess'].run([m.decoder_output, m.encoder_output, m.losses], feed_dict={m.input_images: img, m.target_images: img_t, m.masks: weight_mask})   

Have noticed that there is a function "depth_single_image", I try it and calculate the loss again:

        if 1:         
            predicted = depth_single_image(predicted)
            diff = np.abs(predicted - img_t)
            diff[weight_mask == 0] = 0
            l1_loss = np.sum(diff)/(np.sum(weight_mask))

But still, the loss seems not right(around 0.15). But this time, the generated prediction pic seems same as the result in the demo website: image

I guess there is something wrong with my processing of target images, and I'm quite confused now.

@alexsax

emergencyd avatar Jan 25 '19 14:01 emergencyd