POP3D icon indicating copy to clipboard operation
POP3D copied to clipboard

Regarding the question on feature space

Open KarlLi5 opened this issue 1 year ago • 5 comments

Dear author, I am new to this field and I have a detailed question about the methodology. For instance, works like sclip that achieve zero-shot open-vocabulary through clip generally use pamr for post-processing on the prediction results to obtain visually observable segmentation results. In your work, by learning the feature space of the image encoder from maskclip+ to a voxel space and then directly inner product with the text embedding, you are able to obtain significant prediction results. Could you explain what in the optimization process leads to this outcome?

KarlLi5 avatar Mar 02 '24 06:03 KarlLi5

Dear @KarlLi5, I am not sure that I completely understand your question. What exactly do you mean by "this outcome"?

vobecant avatar Mar 04 '24 12:03 vobecant

I apologize, there might have been a misunderstanding in my explanation. What I meant to say is that without the refinement of masks by PAMR, the 2D segmentation results obtained using CLIP would be poor. If these feature spaces were projected into a 3D space, theoretically, the voxel outcomes wouldn’t be impressive either. However, looking at the visualization charts from POP3D, the prediction results for common classes in the dataset are notable.

KarlLi5 avatar Mar 04 '24 12:03 KarlLi5

Hi, I think that the main reason would be the quality of the MaskCLIP+ features, don't you agree?

vobecant avatar Mar 04 '24 14:03 vobecant

Due to the limitations of my device, I am unable to successfully replicate your work, so I have some questions about the details. Thank you for your reply!

KarlLi5 avatar Mar 05 '24 05:03 KarlLi5

@KarlLi5 , so what are your questions?

vobecant avatar Aug 13 '24 09:08 vobecant

Should there still be a problem, please feel free to re-open the issue.

vobecant avatar Oct 26 '24 08:10 vobecant