model-soups Souping on regression model leading to a drastic drop in accuracy

trafficstars

Hello,

I have a regression model that I composed by taking a MobileNet classifier (pre-trained with ImageNet weights), then removing its classification head and adding a flatten+dense layer that spits out a scalar output. I define an accuracy metric based on if the absolute error is below a threshold.

I take the above model and train it first using LP for 15 iterations, then using FT for 2 iterations. This is my starter model. This starter model was trained using RMSprop.

I then take this starter model, and train it (using LP) for a variable number of iterations, variable learning rate, variable optimizer types (RMSprop, Adam, AdamW), variable seeds to get my soup ingredient models.

I get approximately 91% accuracy on a held-out test using the starter model, 93% and 94% using two of my ingredient models.

Issue: I take a random pair of well performing models (>90%) amongst my starter and ingredient models, and average their weights. However, almost always the souped models have an accuracy of 2% on the test set.

Illustrative code I use to average the weights:

def uniform_soup(model_list):
    soups = []
    
    tf.keras.backend.clear_session()
    model_init = create_skeleton_model() #Any model from my starter or ingredients just for its architecture.
    
    for model_individual in model_list:
                
        soup = [np.array(weights) for weights in model_individual.weights]
        soups.append(soup)
         
    mean_soup = np.array(soups).mean(axis = 0)
    
    ## Replacing model's weight with Unifrom Soup Weights
    for w1, w2 in zip(model_init.weights, mean_soup ):
        tf.keras.backend.set_value(w1, w2)
        
    return model_init

Is there anything wrong in my design or anything that stands out to you?
Is it okay to use a regression model? Does anything in the loss landscape change owing to it being a regression model?

I did peruse through #10 and followed your advice on that thread to design my souping.

Thanks in advance.

Aug 04 '23 01:08 akhilperincherry

We have not tried regression models.. but I don't really see why that wouldn't work. I'm confused by this:

I get approximately 91% accuracy on a held-out test using the starter model, 93% and 94% using two of my ingredient models.

Does this mean that souping two models works but more does not?

Aug 04 '23 17:08 mitchellnw

No, sorry, I meant to say the models by themselves individually have good performance; starter model by itself results in 91% accuracy on a test-set, and ingredient models by themselves also have good individual performance of 93 and 94 for 2 of my ingredient models. However, when I soup them, it falls to 2.xx %. No souping works in my experiment, either with 2 models or more (Most of my experiments were souping with two ingredient models).

Another aspect is the MobileNet model is a much smaller network than the ones used in the paper. The paper did say to expect marginal performance increases with smaller ImageNet based models but seeing this drastic a drop tells me there is fundamentally something wrong in maybe my design?

Aug 04 '23 18:08 akhilperincherry

Hmmm. Are you introducing new params when fine-tuning? What LR?

Aug 13 '23 22:08 mitchellnw

No new parameters.

The starter model was trained at a LR of 0.005+RMSProp. The 7 ingredient models for souping were trained at LRs of {0.001+Adam, 0.005+Adam, 0.001+AdamW, 1e-05+AdamW, 0.0005+RMSProp, 2e-05+RMSProp, 0.001+AdamW}.

Aug 14 '23 17:08 akhilperincherry

Can you try just souping the small LR models, e.g., 1e-05+AdamW and 2e-05+RMSProp. I think the LR may just be too high for the other models

Aug 14 '23 20:08 mitchellnw

Thanks for the suggestion. I took the two models you mentioned (call them m1, m2) and I also took their starter model s0.

Souping s0, m2 -> 3.26% Souping s0, m1 -> 9.54% Souping m1, m2 -> 5.70% Souping s0, m1, m2 -> 4.50%

The values look better than what I've seen before (~2%) but still pretty bad overall. I also see that their range of predictions have reduced i.e. the individual models can predict values that range from 0 - 180 whereas the souped models' o/p ranges are much smaller for instance 30 - 90. I wonder if this is to do with a reduced representative ability?

Aug 21 '23 21:08 akhilperincherry

Hmm. I really don't know. I guess souping + regression may be an open problem. Sorry about that.

Aug 21 '23 21:08 mitchellnw

No worries, thank you.

Aug 21 '23 23:08 akhilperincherry

model-soups model-soups copied to clipboard

Souping on regression model leading to a drastic drop in accuracy

model-soups
model-soups copied to clipboard