dinov2 Feature transformation before PCA

Thank you for sharing Figure 1 from the paper, which showcases the mapping of features to RGB channels using PCA. I found it to be really impressive! I was wondering if I could ask a question about the details of the PCA process. Specifically, I was curious to know if the features were normalized, scaled, or translated before applying PCA. If they were, could you kindly provide me with more information on how the normalization, scaling, or translation was carried out? For instance, I am curious about the axis along which normalization or scaling was performed, and whether the normalization or scaling factors were computed based on individual images or the entire training dataset. Thank you very much in advance for your help!

Apr 21 '23 19:04 WangYixuan12

        x_norm = self.norm(x)
        return {
            "x_norm_clstoken": x_norm[:, 0],
            "x_norm_patchtokens": x_norm[:, 1:],
            "x_prenorm": x,
            "masks": masks,
        }

There is a layer norm applied right before returning the output, also in the last FFN block there is a layerscale.

So far as for PCA analysis, the first PCA (background/foreground) seems to be applied per image, the second PCA (RGB features) is applied over a small batch of similar images.

That's the only information we have so far.

Apr 22 '23 17:04 ccharest93

Hello @WangYixuan12, thanks for liking our work

The pseudo code for computing PCA features is the following:

Getting the foreground segmentor: Compute the PCA on 1 image (can be more) with a clear background (like an animal with the sky behind it, or a drawing on top of white background). The image should be clean with one object and no noisy background. If you visualize the 1st component in 2D it should separate the background and the foreground, this 1st component can then be used for 0-shot background/foreground segmentation.
Keeping only the foreground tokens: Apply this PCA on all images you want to visualize and keep only the tokens with positive/negative 1st component (the sign is arbitrary so just check on one image), this is the "background/foreground" separation
Get a PCA space that we will color as RGB: Compute a new PCA on all foreground tokens from all images.
Normalize the PCA: Min-Max normalize the 3 first PCA features independently (so one norm per channel) with a min and max computed over all images. You can also compute the min/max per image.
Plot your RGB image: Project back the tokens to their original 2D shape and multiply by 255 to get your RGB visualization

What you have to keep in mind is that by doing so we want the PCA to focus on shared factor of variations for "foreground" objects. So we need to remove other factor of variation (background) and compute a shared PCA/RGB space between all images. The PCA visualization only show a small space of what the features semantically represents, so feel free to try other kind of mapping and try to look to other PCA dimensions (they usually look even more incredible than the first 3 ones !).

Apr 24 '23 10:04 TheoMoutakanni

Closing as answered.

Aug 23 '23 21:08 patricklabatut

dinov2 dinov2 copied to clipboard

Feature transformation before PCA

dinov2
dinov2 copied to clipboard