PIFu icon indicating copy to clipboard operation
PIFu copied to clipboard

Collection of questions

Open GTO2013 opened this issue 3 years ago • 6 comments

Hi, I am currently working on understanding PiFu in Detail, so I can apply it to cars. I already wrote you an email if you remember. I have it running and its doing decently already for the short timeframe :)

Here are my questions:

  • The MLP Surface Classifier uses Conv1D operations, is there any reason why this was chosen over fully connected neurons? Because thats was I thought MLP represents.
  • The SDF values are binarized in your case, so even if a point is extremly close to of the surface and the classifier predicts something around 0.5 the error will still be large (1-0.5). Why is this done that way? The DeepSDF paper for example uses the actual distance with a clamping value.
  • In the multiview case all stacks are processed in parallel and then the mean is calculated somewhere in the middle of the classifier. Each point has a stack of feature vectors and the depth of the point we want to query. But how can the layers after the mean know where the point we want to query is? Wouldnt it be way more precise to just give the classifier normalized X,Y,Z coordinates in worldspace?
  • In your paper you use this graphic: image

From this image it looks like the input image is directly added to the feature vector after the encoder (line below the upper image encoder). But where is this done in code? I have looked for it, but I didnt find it. The input always goes through some filtering before the classifier sees it, is that correct?

Sorry for the amount of questions, I hope you find some time to answer them.

Thank you!

GTO2013 avatar Nov 28 '20 15:11 GTO2013

The MLP Surface Classifier uses Conv1D operations, is there any reason why this was chosen over fully connected neurons? Because thats was I thought MLP represents.

It's just a matter of design choice. I just wanted to keep the input tensors as (B, 3, N), but you can surely use fully connected layers instead.

The SDF values are binarized in your case, so even if a point is extremly close to of the surface and the classifier predicts something around 0.5 the error will still be large (1-0.5). Why is this done that way? The DeepSDF paper for example uses the actual distance with a clamping value.

This is also a design choice. We pose it as classification problem of inside/outside occupancy instead of regression problem with SDF. Both are valid choices.

In the multiview case all stacks are processed in parallel and then the mean is calculated somewhere in the middle of the classifier. Each point has a stack of feature vectors and the depth of the point we want to query. But how can the layers after the mean know where the point we want to query is? Wouldnt it be way more precise to just give the classifier normalized X,Y,Z coordinates in worldspace?

The problem is how to define "world space". Unlike ShapeNet, humans are articulated and thus difficult to define canonicalized space. Unless all the data samples share the same normalization, providing world coordinates is of little help. The multiview PIFu takes xyz information in a view dependent way and average pooling consolidates those view information to make final prediction in the shared coordinate space.

From this image it looks like the input image is directly added to the feature vector after the encoder (line below the upper image encoder). But where is this done in code? I have looked for it, but I didnt find it. The input always goes through some filtering before the classifier sees it, is that correct?

The feature from geometry module is concatenated after image encoder (yes, the figure is misleading.) You can find how its is added below. https://github.com/shunsukesaito/PIFu/blob/975331106479436356fe8fae9ca2b96a56926930/lib/train_util.py#L79

shunsukesaito avatar Dec 13 '20 18:12 shunsukesaito

Ahh, I see, thank you! I have one last question: You are iterating over the number of hour glass stacks and for each stack you query the classifier to train each stack. During testing only the last stack is used. But when you do this, you are also training the MLP with feature vectors it will never see during testing. So it has to find a way to make good predictions for each stack, right? Wouldnt it make more sense to disable backpropagation for the MLP except for the last stack, so you are only changing the weights in the hour glass stacks?

I have changed the entire setup quite dramatically, so I dont know if this also applies to your original code. But I found that using only the last stack (so no intermediate loss) during training dramatically improves the result, it goes from 60% IOU to around 80%.

GTO2013 avatar Dec 18 '20 16:12 GTO2013

Sorry for late reply.

So it has to find a way to make good predictions for each stack, right?

Yes. The original motivation was to let each stack to predict the consistent feature vectors by sharing the same MLP. On the other hand, this may limit the expressiveness of the reconstruction from the last feature (and looks like your experiment indicates that too). Please let me know if you publish/release your ongoing experiments somewhere. As PIFu can be still improved in many aspects and I'm curious how others progress this field!

shunsukesaito avatar Jan 21 '21 15:01 shunsukesaito

Ahh, I see, thank you! I have one last question: You are iterating over the number of hour glass stacks and for each stack you query the classifier to train each stack. During testing only the last stack is used. But when you do this, you are also training the MLP with feature vectors it will never see during testing. So it has to find a way to make good predictions for each stack, right? Wouldnt it make more sense to disable backpropagation for the MLP except for the last stack, so you are only changing the weights in the hour glass stacks?

I have changed the entire setup quite dramatically, so I dont know if this also applies to your original code. But I found that using only the last stack (so no intermediate loss) during training dramatically improves the result, it goes from 60% IOU to around 80%.

Hi Robin,

I am currently working on understanding PiFu in Detail. I am curious about how you improved the IOU by a large margin. Could you please elaborate your changes to improve the IOU? Have you published your changes somewhere? Thanks a lot!

ywyue avatar Oct 13 '21 21:10 ywyue

Ahh, I see, thank you! I have one last question: You are iterating over the number of hour glass stacks and for each stack you query the classifier to train each stack. During testing only the last stack is used. But when you do this, you are also training the MLP with feature vectors it will never see during testing. So it has to find a way to make good predictions for each stack, right? Wouldnt it make more sense to disable backpropagation for the MLP except for the last stack, so you are only changing the weights in the hour glass stacks? I have changed the entire setup quite dramatically, so I dont know if this also applies to your original code. But I found that using only the last stack (so no intermediate loss) during training dramatically improves the result, it goes from 60% IOU to around 80%.

Hi Robin,

I am currently working on understanding PiFu in Detail. I am curious about how you improved the IOU by a large margin. Could you please elaborate your changes to improve the IOU? Have you published your changes somewhere? Thanks a lot!

You can find my changes here, but I made quite a lot of them: https://github.com/GTO2013/PIFu. Its made to be multi-view and each image can be of different size for example. I used it to reconstruct cars from blueprints: https://twitter.com/RobinK961/status/1387651500302815233

GTO2013 avatar Oct 14 '21 06:10 GTO2013

Ahh, I see, thank you! I have one last question: You are iterating over the number of hour glass stacks and for each stack you query the classifier to train each stack. During testing only the last stack is used. But when you do this, you are also training the MLP with feature vectors it will never see during testing. So it has to find a way to make good predictions for each stack, right? Wouldnt it make more sense to disable backpropagation for the MLP except for the last stack, so you are only changing the weights in the hour glass stacks? I have changed the entire setup quite dramatically, so I dont know if this also applies to your original code. But I found that using only the last stack (so no intermediate loss) during training dramatically improves the result, it goes from 60% IOU to around 80%.

Hi Robin, I am currently working on understanding PiFu in Detail. I am curious about how you improved the IOU by a large margin. Could you please elaborate your changes to improve the IOU? Have you published your changes somewhere? Thanks a lot!

You can find my changes here, but I made quite a lot of them: https://github.com/GTO2013/PIFu. Its made to be multi-view and each image can be of different size for example. I used it to reconstruct cars from blueprints: https://twitter.com/RobinK961/status/1387651500302815233

Thanks for your quick reply! I will check this.

ywyue avatar Oct 14 '21 09:10 ywyue