POP3D Regarding the question on feature space

Dear author, I am new to this field and I have a detailed question about the methodology. For instance, works like sclip that achieve zero-shot open-vocabulary through clip generally use pamr for post-processing on the prediction results to obtain visually observable segmentation results. In your work, by learning the feature space of the image encoder from maskclip+ to a voxel space and then directly inner product with the text embedding, you are able to obtain significant prediction results. Could you explain what in the optimization process leads to this outcome?

Mar 02 '24 06:03 KarlLi5

Dear @KarlLi5, I am not sure that I completely understand your question. What exactly do you mean by "this outcome"?

Mar 04 '24 12:03 vobecant

I apologize, there might have been a misunderstanding in my explanation. What I meant to say is that without the refinement of masks by PAMR, the 2D segmentation results obtained using CLIP would be poor. If these feature spaces were projected into a 3D space, theoretically, the voxel outcomes wouldn’t be impressive either. However, looking at the visualization charts from POP3D, the prediction results for common classes in the dataset are notable.

Mar 04 '24 12:03 KarlLi5

Hi, I think that the main reason would be the quality of the MaskCLIP+ features, don't you agree?

Mar 04 '24 14:03 vobecant

Due to the limitations of my device, I am unable to successfully replicate your work, so I have some questions about the details. Thank you for your reply！

Mar 05 '24 05:03 KarlLi5

@KarlLi5 , so what are your questions?

Aug 13 '24 09:08 vobecant

Should there still be a problem, please feel free to re-open the issue.

Oct 26 '24 08:10 vobecant