Discrepancy between MOSAIK paper and MPC RCF implementation

Hi!

Given the recent open-sourcing of AlphaEarth by DeepMind and its comparison with MOSAIK [1], I was taking a closer look at the implementation from @calebrob6's PR (https://github.com/microsoft/PlanetaryComputerExamples/pull/70).

Expected behavior (from paper)

Based on the MOSAIK paper, the Random Convolution Features (RCF) process works as follows:

Given an input image I, features are computed by randomly sampling K patches from across N images from the training set
These K patches are then convolved over I to obtain K feature maps
The feature maps are averaged over pixels to produce a K-dimensional feature vector X_i

The following image from the paper illustrates this process:

Actual implementation

However, the MPC implementation uses a different approach:

class RCF(nn.Module):
    """A model for extracting Random Convolution Features (RCF) from input imagery."""
    def __init__(self, num_features=16, kernel_size=3, num_input_channels=3):
        super(RCF, self).__init__()
        # We create `num_features / 2` filters so require `num_features` to be divisible by 2
        assert num_features % 2 == 0
        self.conv1 = nn.Conv2d(
            num_input_channels,
            num_features // 2,
            kernel_size=kernel_size,
            stride=1,
            padding=0,
            dilation=1,
            bias=True,
        )
        nn.init.normal_(self.conv1.weight, mean=0.0, std=1.0)
        nn.init.constant_(self.conv1.bias, -1.0)
    
    def forward(self, x):
        x1a = F.relu(self.conv1(x), inplace=True)
        x1b = F.relu(-self.conv1(x), inplace=True)
        x1a = F.adaptive_avg_pool2d(x1a, (1, 1)).squeeze()
        x1b = F.adaptive_avg_pool2d(x1b, (1, 1)).squeeze()
        if len(x1a.shape) == 1:  # case where we passed a single input
            return torch.cat((x1a, x1b), dim=0)
        elif len(x1a.shape) == 2:  # case where we passed a batch of > 1 inputs
            return torch.cat((x1a, x1b), dim=1)

As I understand it, instead of using K kernels extracted from N training images (as described in the paper), this implementation creates K randomly initialized kernels. While randomly initialized kernels can be powerful feature extractors, this approach differs significantly from the paper's methodology.

Am I missing something that explains why this implementation choice was made? Is there a specific reason for deviating from the paper's approach of sampling patches from training images?

Thank you!

References [1] Rolf, Esther, et al. "A generalizable and accessible approach to machine learning with global satellite imagery." Nature communications 12.1 (2021): 4392.

Jul 31 '25 10:07 alexlopezcifuentes

I want to update that the Supplementary Material from the paper states that:

Indeed, MOSAIKS is mathematically identical to the architecture one would arrive at if one designed a very shallow and very wide CNN without using backpropagation and instead using random filters. Specifically, MOSAIKS could be viewed as a two-layer CNN that has an 8,192-neuron wide hidden layer with untrained weights that are randomly initialized by drawing from sub-images in the sample, and that uses an average-pool over the entire image.

But this also deviates from the actual implementation as the convolutional layers are initialized with values drawn from the normal distribution not from the sub-images in the sample, right?

Jul 31 '25 11:07 alexlopezcifuentes

Hey @alexlopezcifuentes -- you are correct that MOSAIKS as described in Rolf et al. uses kernels sampled from the dataset (and this implementation just chooses random samples). This implementation also doesn't whiten the kernels! If you are interested, we have a complete implementation in torchgeo -- https://github.com/microsoft/torchgeo/blob/main/torchgeo/models/rcf.py.

Pinging @estherrolf for motivation on the empirical vs. random features question

Jul 31 '25 14:07 calebrob6

Also, in https://arxiv.org/pdf/2305.13456 we find that sampling patches from the dataset (i.e. RCF empirical or "MOSAIKS") is often better than random patches (i.e. RCF gaussian)

Jul 31 '25 14:07 calebrob6

Hi @calebrob6 thanks a lot for the quick reply. I will take a look at the torchgeo implementation to see how is done there.

Thanks in addition for sharing the paper, pretty cool that you investigated the differences between sampling patches from dataset vs the random ones.

It would be also really nice to hear @estherrolf comments and intuition on the differences and motivation for sure!

Jul 31 '25 14:07 alexlopezcifuentes

Question regarding MOSAIKS Implementation

Discrepancy between MOSAIK paper and MPC RCF implementation

Expected behavior (from paper)

Actual implementation