Question regarding MOSAIKS Implementation
Discrepancy between MOSAIK paper and MPC RCF implementation
Hi!
Given the recent open-sourcing of AlphaEarth by DeepMind and its comparison with MOSAIK [1], I was taking a closer look at the implementation from @calebrob6's PR (https://github.com/microsoft/PlanetaryComputerExamples/pull/70).
Expected behavior (from paper)
Based on the MOSAIK paper, the Random Convolution Features (RCF) process works as follows:
- Given an input image
I, features are computed by randomly samplingKpatches from acrossNimages from the training set - These
Kpatches are then convolved overIto obtainKfeature maps - The feature maps are averaged over pixels to produce a
K-dimensional feature vectorX_i
The following image from the paper illustrates this process:
Actual implementation
However, the MPC implementation uses a different approach:
class RCF(nn.Module):
"""A model for extracting Random Convolution Features (RCF) from input imagery."""
def __init__(self, num_features=16, kernel_size=3, num_input_channels=3):
super(RCF, self).__init__()
# We create `num_features / 2` filters so require `num_features` to be divisible by 2
assert num_features % 2 == 0
self.conv1 = nn.Conv2d(
num_input_channels,
num_features // 2,
kernel_size=kernel_size,
stride=1,
padding=0,
dilation=1,
bias=True,
)
nn.init.normal_(self.conv1.weight, mean=0.0, std=1.0)
nn.init.constant_(self.conv1.bias, -1.0)
def forward(self, x):
x1a = F.relu(self.conv1(x), inplace=True)
x1b = F.relu(-self.conv1(x), inplace=True)
x1a = F.adaptive_avg_pool2d(x1a, (1, 1)).squeeze()
x1b = F.adaptive_avg_pool2d(x1b, (1, 1)).squeeze()
if len(x1a.shape) == 1: # case where we passed a single input
return torch.cat((x1a, x1b), dim=0)
elif len(x1a.shape) == 2: # case where we passed a batch of > 1 inputs
return torch.cat((x1a, x1b), dim=1)
As I understand it, instead of using K kernels extracted from N training images (as described in the paper), this implementation creates K randomly initialized kernels. While randomly initialized kernels can be powerful feature extractors, this approach differs significantly from the paper's methodology.
Am I missing something that explains why this implementation choice was made? Is there a specific reason for deviating from the paper's approach of sampling patches from training images?
Thank you!
References [1] Rolf, Esther, et al. "A generalizable and accessible approach to machine learning with global satellite imagery." Nature communications 12.1 (2021): 4392.
I want to update that the Supplementary Material from the paper states that:
Indeed, MOSAIKS is mathematically identical to the architecture one would arrive at if one designed a very shallow and very wide CNN without using backpropagation and instead using random filters. Specifically, MOSAIKS could be viewed as a two-layer CNN that has an 8,192-neuron wide hidden layer with untrained weights that are randomly initialized by drawing from sub-images in the sample, and that uses an average-pool over the entire image.
But this also deviates from the actual implementation as the convolutional layers are initialized with values drawn from the normal distribution not from the sub-images in the sample, right?
Hey @alexlopezcifuentes -- you are correct that MOSAIKS as described in Rolf et al. uses kernels sampled from the dataset (and this implementation just chooses random samples). This implementation also doesn't whiten the kernels! If you are interested, we have a complete implementation in torchgeo -- https://github.com/microsoft/torchgeo/blob/main/torchgeo/models/rcf.py.
Pinging @estherrolf for motivation on the empirical vs. random features question
Also, in https://arxiv.org/pdf/2305.13456 we find that sampling patches from the dataset (i.e. RCF empirical or "MOSAIKS") is often better than random patches (i.e. RCF gaussian)
Hi @calebrob6 thanks a lot for the quick reply. I will take a look at the torchgeo implementation to see how is done there.
Thanks in addition for sharing the paper, pretty cool that you investigated the differences between sampling patches from dataset vs the random ones.
It would be also really nice to hear @estherrolf comments and intuition on the differences and motivation for sure!