transformers DPT implementation contains unused parameters

System Info

Kind of irrelevant, but:

- `transformers` version: 4.40.0
- Platform: macOS-14.4.1-arm64-arm-64bit
- Python version: 3.9.16
- Huggingface_hub version: 0.20.3
- Safetensors version: 0.4.2
- Accelerate version: 0.22.0
- Accelerate config: 	not found
- PyTorch version (GPU?): 2.3.0 (False)
- Tensorflow version (GPU?): 2.13.0 (False)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: n/a
- Using distributed or parallel set-up in script?: yes

Who can help?

The first (zeroth) fusion layer is never used and causing issues like when run on DDP:

Parameters which did not receive grad for rank 3: neck.fusion_stage.layers.0.residual_layer1.convolution2.bias, 
neck.fusion_stage.layers.0.residual_layer1.convolution2.weight, neck.fusion_stage.layers.0.residual_layer1.convolution1.bias, 
neck.fusion_stage.layers.0.residual_layer1.convolution1.weight

@amyeroberts

Information

[X] The official example scripts
[X] My own modified scripts

Reproduction

We take the code from DPT doc page https://huggingface.co/docs/transformers/main/en/model_doc/dpt

make forward-backward pass and check unused parameters


import torch
from transformers import Dinov2Config, DPTConfig, DPTForDepthEstimation

# initialize with a Transformer-based backbone such as DINOv2
# in that case, we also specify `reshape_hidden_states=False` to get feature maps of shape (batch_size, num_channels, height, width)
backbone_config = Dinov2Config.from_pretrained("facebook/dinov2-base", out_features=["stage1", "stage2", "stage3", "stage4"], reshape_hidden_states=False)

config = DPTConfig(backbone_config=backbone_config)
model = DPTForDepthEstimation(config=config)

out=model(torch.rand(1,3,512,512))
loss = out.predicted_depth.mean()
loss.backward()
for n, p in model.named_parameters():
    if p.grad is None:
        if 'backbone' in n: continue # part of backbone is not used and that is fine
        print(f"found unused param, {n}")

Result:

found unused param, neck.fusion_stage.layers.0.residual_layer1.convolution1.weight
found unused param, neck.fusion_stage.layers.0.residual_layer1.convolution1.bias
found unused param, neck.fusion_stage.layers.0.residual_layer1.convolution2.weight
found unused param, neck.fusion_stage.layers.0.residual_layer1.convolution2.bias

This prevents DDP training. To fix that, one should add line into DPT model

self.neck.fusion_stage.layers[0].residual_layer1 = None

Here is the same fix in the mmsegmentation

https://github.com/open-mmlab/mmsegmentation/blob/main/mmseg/models/decode_heads/dpt_head.py#L271

self.fusion_blocks[0].res_conv_unit1 = None

I can submit a PR which makes this fix

Expected behavior

Not have unused parameters.

May 03 '24 10:05 ducha-aiki

Hi @ducha-aiki, thanks for reporting!

You are right, it looks like we can safely delete layers[0].residual_layer1 from DPTFeatureFusionStage because its never used.

Would you mind sharing why this prevents DDP training?

May 03 '24 14:05 qubvel

@qubvel I believe I shared this in:

Parameters which did not receive grad for rank 3: neck.fusion_stage.layers.0.residual_layer1.convolution2.bias, neck.fusion_stage.layers.0.residual_layer1.convolution2.weight, neck.fusion_stage.layers.0.residual_layer1.convolution1.bias, neck.fusion_stage.layers.0.residual_layer1.convolution1.weight

That is a quote from the error crash message I am getting, when running with accelerate for multi-GPU, when I specify in Trainer ddp_find_unused_parameters=False.

May 03 '24 14:05 ducha-aiki

Thank you, I missed it 🙂 I am trying to understand why backbone unused weights are not blocking, while neck's block. Did you try training with a fix? Anyway if this solves the issue it is worth a PR.

May 03 '24 15:05 qubvel

@qubvel good point about the backbone. Probably because I have trained with a frozen backbone, which is kind of common. And about the backbone removing unused params there would probably required too much changes. I will do a PR then, thanks.

May 03 '24 15:05 ducha-aiki

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 23 '24 08:07 github-actions[bot]

transformers transformers copied to clipboard

DPT implementation contains unused parameters

System Info

Who can help?

Information

Reproduction

Expected behavior

transformers
transformers copied to clipboard