CLIP-fine-tune Question About Geometric Parametrization

Hello,

Thanks for your amazing work,

I am currently fine-tuning a CoCa model leveraging the pre-trained weights. I did not understand if we can transform the pre-trained weights into their geometric parametrization, or if we need to train the CLIP model from scratch in geometric parametrization form. Are the two parametrizations in this sense "equivalent" ?

Thanks,

Mathieu

Jul 16 '24 11:07 mat10599

Hi Mathieu!

You can indeed use any pre-trained weights, convert them to geometric parametrization (GmP) -> fine-tune -> done -> convert back from GmP .theta and .r -> to .weight. My fine-tuning code does the conversion from "normal" pre-trained .weight to GmP; the code I provided in exp-ft-C-convert-GmP-back-to-weight.py, on the other hand, converts the fine-tuned model back to .weight.

Converting the model back to .weight after fine-tuning makes it "just like the original pre-trained weights", so you can use it for down-stream tasks as-is. In fact, I strongly recommend converting it back to weight:

I have conducted some experiments with GmP using gradient ascent (visualizing the features in CLIP) and found that enforcing determinism via PyTorch does NOT lead to deterministic behavior on GPU when using GmP. It essentially behaves like a "random seed" with slightly different outcomes due to numerical instability (even in full precison). However, once the model is converted to the "normal" .weight, the GmP fine-tuned model behaves flawlessly (and in a deterministic way, when PyTorch etc. is set to be deterministic).

I have fine-tuned my GmP CLIP (from pre-trained OpenAI/CLIP) - which outperforms original pre-trained CLIP on ImageNet/ObjectNet + VOC2007_multilabel - on 1x RTX 4090 (!) with a batch_size of 40 (!!!), dataset: 40k text-image pairs. =)

So, yes, you can indeed transform pre-trained weights, and GmP allows for efficient fine-tuning even on a "very GPU-poor compute resource" that would otherwise (without GmP) lead to overfitting or even embeddings collapsing due to small batch_size (CLIP models are typically trained on batch_size 2048 and up). No need to train anything from scratch and spend $100k!

GmP can likely be applied to a broad range of different models; in fact, my implementation for CLIP is actually based on the truly amazing work of the authors of the paper ReLU Characteristic Activation Analysis - which discusses the dramatically improved convergence + superior stability of GmP in the context of other, ReLU-based models. I merely decided to "give it a shot, trial-and-error" and applied their research to GELU-based CLIP (which the authors did not mention in their paper).

I hope that helps - let me know if you have any other questions. Kind regards!

Jul 16 '24 12:07 zer0int

PS: Here's the low-down of the changes I made to CLIP, as written by GPT-4o after 'doing a diff' (I read it and can confirm there are no 'hallucinations' in the AI's response, though!).

1. Diff between `model.py` and `modelgmp.py`

Added GeometricLinear Class in modelgmp.py:

class GeometricLinear(nn.Module):
    def __init__(self, in_features, out_features, bias=True):
        super(GeometricLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features

        # Radial component
        self.r = nn.Parameter(torch.Tensor(out_features, 1))
        # Angular component
        self.theta = nn.Parameter(torch.Tensor(out_features, in_features))

        if bias:
            self.bias = nn.Parameter(torch.Tensor(out_features))
        else:
            self.register_parameter('bias', None)

        self.reset_parameters()

    def reset_parameters(self):
        nn.init.kaiming_uniform_(self.theta, a=np.sqrt(5))
        fan_in = self.in_features
        bound = 1 / np.sqrt(fan_in)
        nn.init.uniform_(self.r, -bound, bound)
        if self.bias is not None:
            nn.init.uniform_(self.bias, -bound, bound)

    def forward(self, input):
        u = F.normalize(self.theta, p=2, dim=1)  # Normalize theta to get unit vector u
        output = F.linear(input, self.r * u)     # Geometric parameterization
        if self.bias is not None:
            output += self.bias
        return output

Modified ResidualAttentionBlock to use GeometricLinear instead of nn.Linear:

In modelgmp.py, the ResidualAttentionBlock class now uses GeometricLinear instead of nn.Linear:

class ResidualAttentionBlock(nn.Module):
    def __init__(self, d_model: int, n_head: int, attn_mask: torch.Tensor = None):
        super().__init__()

        self.attn = nn.MultiheadAttention(d_model, n_head)
        self.ln_1 = LayerNorm(d_model)
        self.mlp = nn.Sequential(OrderedDict([
            ("c_fc", GeometricLinear(d_model, d_model * 4)),
            ("gelu", QuickGELU()),
            ("c_proj", GeometricLinear(d_model * 4, d_model))
        ]))
        self.ln_2 = LayerNorm(d_model)
        self.attn_mask = attn_mask

Initialization in CLIP Class:

In modelgmp.py, there are adjustments for initializing GeometricLinear layers:

# Handle GeometricLinear layers
if isinstance(block.mlp.c_fc, GeometricLinear):
    nn.init.normal_(block.mlp.c_fc.r, std=fc_std)
    nn.init.kaiming_uniform_(block.mlp.c_fc.theta, a=np.sqrt(5))
else:
    nn.init.normal_(block.mlp.c_fc.weight, std=fc_std)

if isinstance(block.mlp.c_proj, GeometricLinear):
    nn.init.normal_(block.mlp.c_proj.r, std=proj_std)
    nn.init.kaiming_uniform_(block.mlp.c_proj.theta, a=np.sqrt(5))
else:
    nn.init.normal_(block.mlp.c_proj.weight, std=proj_std)

Adjustment for State Dictionary:

The build_model function in modelgmp.py includes a step to adjust the state dictionary for GeometricLinear layers:

def adjust_state_dict(state_dict):
    new_state_dict = {}
    
    for key, value in state_dict.items():
        # Handle the conversion for GeometricLinear layers
        if "mlp.c_fc.weight" in key:
            base_key = key.replace("weight", "")
            new_state_dict[base_key + "r"] = torch.norm(value, dim=1, keepdim=True)
            new_state_dict[base_key + "theta"] = F.normalize(value, p=2, dim=1)
        elif "mlp.c_proj.weight" in key:
            base_key = key.replace("weight", "")
            new_state_dict[base_key + "r"] = torch.norm(value, dim=1, keepdim=True)
            new_state_dict[base_key + "theta"] = F.normalize(value, p=2, dim=1)
        else:
            new_state_dict[key] = value

    return new_state_dict

2. Explanation of the Differences and Implications

Geometric Parameterization (GeometricLinear):

The GeometricLinear class introduces a novel way of parameterizing linear transformations. Instead of directly learning the weights, it decomposes the weights into a radial component r and an angular component theta.
r represents the magnitude, while theta represents the direction. This decomposition helps to better capture the geometric properties of the data.

Implications:

Better Generalization:
- The separation of magnitude and direction can lead to better generalization as the model might learn more robust features that are less sensitive to the scale of the input.
Stability:
- Normalizing theta ensures that the direction of the weights remains unit length, which can help in stabilizing the training process and prevent exploding gradients.
Interpretability:
- This decomposition can also make the model more interpretable, as the influence of individual features can be analyzed through the magnitudes and directions separately.
Compatibility:
- By adjusting the state dictionary to handle the GeometricLinear layers, existing models can be fine-tuned or loaded into the new architecture without issues.

The changes in the modelgmp.py version primarily focus on integrating this geometric parameterization into the model, which can lead to improved performance and stability in various tasks.

Jul 16 '24 12:07 zer0int

Hello,

Thanks it is very clear,

Just one question before I try it. If I use model.load_state_dict in the modified CLIP version, will the weights be automatically translated into their Geometric parametrization or do I have to call the adjust_state_dict method ?

But I am definitely curious about this and I will have a go at it !

Jul 18 '24 15:07 mat10599

You can just use import gmpclip as clip from my repo, instead of import clip. That will work for any code (but you may need to adjust the code itself to work with the model, depending on what you are trying to do).

If you then use model, preprocess = clip.load("ViT-L/14") and do print(model), you will see it has GeometricLinear() in the MLP:

(c_fc): GeometricLinear()
(gelu): QuickGELU()
(c_proj): GeometricLinear()

This model is "converted" to GmP.

However, it's not of much use without fine-tuning / adjusting the weights for the modification.

If you mean after fine-tuning, you can do (without converting the weights back to .weight; your original saved finetune):

import gmpclip as clip
_, preprocess = clip.load("ViT-L/14")
model = torch.load("your_finetuned_model_file.pt")

I hope that answers your question (I am not entirely sure that's what you asked)!

Jul 18 '24 18:07 zer0int

CLIP-fine-tune CLIP-fine-tune copied to clipboard

Question About Geometric Parametrization

1. Diff between model.py and modelgmp.py

2. Explanation of the Differences and Implications

CLIP-fine-tune
CLIP-fine-tune copied to clipboard

1. Diff between `model.py` and `modelgmp.py`