transformers Inconsitent module names (state

System Info

transformer version: 4.39.1 platform: Ubuntu 22.04 Python version: 3.10.12

Who can help?

@amyeroberts

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

I used the facebook/convnextv2 family to train models for my own classification task. At first, it was unable to learn anything. I found that it printed the following warning message:

Some weights of ConvNextV2Backbone were not initialized from the model checkpoint at facebook/convnextv2-nano-22k-384 and are newly initialized: ['convnextv2.hidden_states_norms.stage4.bias', 'convnextv2.hidden_states_norms.stage4.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

So I checked the module names (via state_dict.keys()), the last few ones are:

encoder.stages.3.layers.1.pwconv2.weight
encoder.stages.3.layers.1.pwconv2.bias
hidden_states_norms.stage4.weight
hidden_states_norms.stage4.bias

The state_dict keys loaded from the local cache are (last few ones):

convnextv2.encoder.stages.3.layers.1.pwconv2.weight
convnextv2.encoder.stages.3.layers.1.pwconv2.bias
convnextv2.layernorm.weight
convnextv2.layernorm.bias
classifier.weight
classifier.bias

I realized that the pretrained model hosted on the hub was trained with the last module named layernorm which was changed to hidden_states_norms (I don't know from which version) in the current version, hence was assigned with random weights, ruining the whole backbone.

Expected behavior

My workaround for this issue is adding the following method:

    def __post_init(self) -> None:
        if self.source == "hf":
            if re.search("facebook(\\/|\\-\\-)convnextv2", self.backbone_name_or_path):
                if Path(self.backbone_name_or_path).exists():
                    weight_file = list(Path(self.backbone_name_or_path).rglob("pytorch_model.bin"))[0]
                else:
                    weight_file = list(
                        (Path(MODEL_CACHE_DIR) / Path(f"""models--{self.backbone_name_or_path.replace("/", "--")}"""))
                        .expanduser().resolve().rglob("pytorch_model.bin")
                    )[0]
                state_dict = torch.load(weight_file)
                new_state_dict = {
                    "stage4.weight": state_dict["convnextv2.layernorm.weight"].detach().clone(),
                    "stage4.bias": state_dict["convnextv2.layernorm.bias"].detach().clone(),
                }
                self.backbone.hidden_states_norms.load_state_dict(new_state_dict)
                print(
                    "Loaded layer norm weights from the model weights for the last hidden_states_norms layer from "
                    f"weights file: {str(weight_file)}"
                )
                # remove `state_dict` to avoid potential memory leak
                del state_dict

which is called at the end of the init function of my customized model. After adding this workaround, the model finally started to learn from my data.

I wonder if we can add a keyword argument named for example key_mapping to the classmethod transformers.AutoBackbone.from_pretrained (someone forgot to add a docstring to this method in version 4.39.1) so that we can create the backbones via

transformers.AutoBackbone.from_pretrained(
    backbone_name_or_path,
    key_mapping={
        "layernorm.weight": "hidden_states_norms.stage4.weight",
        "layernorm.bias": "hidden_states_norms.stage4.bias",
    },
)

Or we can hard code this in the transformers library by comparing the two versions in the config class and doing the corresponding key mapping.

Apr 08 '24 13:04 wenh06

@amyeroberts

Apr 12 '24 09:04 wenh06

Hi,

Is there a reason to not leverage the AutoModelForImageClassification class? This will properly instantiate all the parameters from the checkpoint on the hub, such as facebook/convnextv2-base-1k-224, and allows you to fine-tune the pre-trained ConvNext model on a different labeled dataset.

If you use the AutoBackbone class, then layernorms are added by default after each of the stages that you'd like to get the feature maps from. These need to be learned for a downstream task. For instance, the UPerNet model leverages ConvNext as backbone for the task of semantic segmentation.

By default, only layernorms are added for the last stage. Hence using this:

from transformers import ConvNextV2Backbone

model = ConvNextV2Backbone.from_pretrained("facebook/convnextv2-base-1k-224")

Results in a corresponding warning, telling you that layernorms are added for the last stage and need to be learned from scratch:

Some weights of ConvNextV2Backbone were not initialized from the model checkpoint at facebook/convnextv2-base-1k-224 and are newly initialized: ['convnextv2.hidden_states_norms.stage4.bias', 'convnextv2.hidden_states_norms.stage4.weight']

If you instantiate the class with a different out_indices (for instance if you want to get the features from all 4 stages), then layernorms are added for each stage:

from transformers import ConvNextV2Backbone

model = ConvNextV2Backbone.from_pretrained("facebook/convnextv2-base-1k-224", out_indices=[0,1,2,3])

This is shown in the warning as well:

Some weights of ConvNextV2Backbone were not initialized from the model checkpoint at facebook/convnextv2-base-1k-224 and are newly initialized: ['convnextv2.hidden_states_norms.stage1.bias', 'convnextv2.hidden_states_norms.stage1.weight', 'convnextv2.hidden_states_norms.stage2.bias', 'convnextv2.hidden_states_norms.stage2.weight', 'convnextv2.hidden_states_norms.stage3.bias', 'convnextv2.hidden_states_norms.stage3.weight', 'convnextv2.hidden_states_norms.stem.bias', 'convnextv2.hidden_states_norms.stem.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

Apr 12 '24 13:04 NielsRogge

Because I do not want the head of AutoModelForImageClassification. I have my heads (there are several, sharing the same backbone), and I just want this backbone to serve as a feature extractor. I noticed that

AutoModelForImageClassification.from_pretrained("facebook/convnextv2-large-22k-384")

will give me a ConvNextV2Model rather than a ConvNextV2Backbone inside of it. But they are identical except that they have different names (and classes) for the last layer. The former is

(layernorm): LayerNorm((1536,), eps=1e-12, elementwise_affine=True)

The latter is

(hidden_states_norms): ModuleDict(
  (stage4): ConvNextV2LayerNorm()
)

But this ConvNextV2LayerNorm is almost the same thing as LayerNorm by checking their __dict__:

I just want to train my heads, and keep this backbone frozen as a whole, instead of training this layer norm along with my heads. Can't the two things that are essentially the same have unified behavior?

Apr 12 '24 15:04 wenh06

For example, if I do not specify out_indices, then load the pretrained last layer norm weights into the backbone?

Apr 12 '24 15:04 wenh06

Ok, we could add a boolean flag to ConvNextConfig which determines whether to add/not add layernorms after the stages. Now, layernorms are always added to each of the stages that you want to get the feature maps for (as seen here).

By default, out_indices defaults to -1, which means you'll get the final stage feature maps:

from transformers import ConvNextImageProcessor, ConvNextBackbone
from PIL import Image
import requests

model_id = "facebook/convnextv2-tiny-1k-224"

image_processor = ConvNextImageProcessor.from_pretrained(model_id)
model = ConvNextBackbone.from_pretrained(model_id)

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

pixel_values = image_processor(image, return_tensors="pt").pixel_values

# by default, the feature map of the final stage is returned 
outputs = model(pixel_values)
for i in outputs.feature_maps:
    print(i.shape)

# if you want to get custom feature maps, feel free to change `out_indices`:
model = ConvNextBackbone.from_pretrained(model_id, out_indices=[0,1,2,3])

outputs = model(pixel_values)
for i in outputs.feature_maps:
    print(i.shape)

Apr 12 '24 18:04 NielsRogge

Thank you, that's very nice 👍🏼

Apr 13 '24 00:04 wenh06

@NielsRogge Was a flag ever added for ConvNext? Could you open a PR to address this?

Jun 03 '24 12:06 amyeroberts

cc @NielsRogge

Jun 28 '24 11:06 amyeroberts

A boolean flag is not added yet. There hasn't been a requeset for it yet, so not sure it's worth adding

Jul 01 '24 09:07 NielsRogge

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 26 '24 08:07 github-actions[bot]

Inconsitent module names (state_dict keys).

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior