transformers Incorrect keypoint batch handling inside SuperGlueForKeypointMatching

System Info

transformers version: 4.51.3
Platform: Linux-6.14.6-arch1-1-x86_64-with-glibc2.41
Python version: 3.12.10
Huggingface_hub version: 0.30.2
Safetensors version: 0.5.3
Accelerate version: not installed
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (GPU?): 2.6.0 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: no

Who can help?

@qubvel @sbucaille

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Install pytorch, pillow, and transformers=4.51.3 using either pip or pixi.
Run the following script:

import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import requests

class Test:
    def __init__(self):
        self.processor = AutoImageProcessor.from_pretrained("magic-leap-community/superglue_outdoor")
        self.model = AutoModel.from_pretrained("magic-leap-community/superglue_outdoor")
    
    @torch.inference_mode()
    def get_keypoints(
        self,
        series1: list[Image.Image],
        series2: list[Image.Image]
        ):

        images = []
        for s1, s2 in zip(series1, series2):
            images.append([s1, s2])
        
        processor_inputs = self.processor(images, return_tensors="pt")
        outputs = self.model(**processor_inputs)

        image_sizes = [[(s1.height, s1.width), (s2.height, s2.width)] 
                for s1, s2 in zip(series1, series2)]
        
        processed_outputs = self.processor.post_process_keypoint_matching(
            outputs, image_sizes
        )
        return processed_outputs

url_image1 = "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_98169888_3347710852.jpg"
image1 = Image.open(requests.get(url_image1, stream=True).raw)
url_image2 = "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_26757027_6717084061.jpg"
image2 = Image.open(requests.get(url_image2, stream=True).raw)

test = Test()
kps = test.get_keypoints((image1, image1), (image2, image2))

assert torch.equal(kps[0]['keypoints0'], kps[1]['keypoints0'])
print("Assertion succeeded!")

Expected behavior

The script executes successfully and get_keypoints returns two exact same arrays, assertion succeeds.

I tried to use SuperGlueForKeypointMatching (added in #29886) for batch inference but I found that while it works with single images well, it fails to do batch inference. I believe this is caused by incorrect concatenation inside SuperGlueForKeypointMatching._match_image_pair: https://github.com/huggingface/transformers/blob/d0c9c66d1c09df3cd70bf036e813d88337b20d4c/src/transformers/models/superglue/modeling_superglue.py#L726-L727

Changing this seemingly fixed the issue for me.

        matches = torch.cat([matches0, matches1], dim=1).reshape(batch_size, 2, -1)
        matching_scores = torch.cat([matching_scores0, matching_scores1], dim=1).reshape(batch_size, 2, -1)

May 25 '25 05:05 i44p

Why will this often fail?

In keypoint detection or matching tasks, especially for two consecutive images or different samples, the detected keypoints are almost never identical.
Even for the same image processed twice, minor differences (e.g., from random noise, augmentation, or algorithmic variance) can make the keypoint coordinates different.

When would this assertion succeed?

Only when kps[0]['keypoints0'] and kps[1]['keypoints0'] are exactly the same tensor (identical values, shape, dtype).

Jun 15 '25 08:06 gspeter-max

keypoints0: coordinates of keypoints in image 0 (usually shape: [N, 2], where N is the number of keypoints, and each entry is [x, y])
keypoints1: coordinates of keypoints in image 1 (shape: [M, 2], for M keypoints)

keypoint0 keypoint1 score [644, 20] [712, 179] 0.9726 [650, 65] [715, 215] 0.7948 [638, 66] [707, 213] 0.8859

score is the answer of similarity

Jun 15 '25 08:06 gspeter-max

@gspeter-max are these comments AI-generated? This fails always, not often. I can't make any sense out of this, your comments look hallucinated. I should have probably clarified that I don't expect kps[0]['keypoints0'] and kps[1]['keypoints0'] to be exactly identical, but as I said in the OP there's clearly something wrong with the way matching scores are concatenated inside transformers, which is likely fixed by the changes I suggested.

If I try to run this code as is, without modifying transformers/src/transformers/models/superglue/modeling_superglue.py:

import torch
from transformers import AutoImageProcessor, AutoModel
from PIL import Image
import matplotlib.pyplot as plt
import requests


class Test:
    def __init__(self):
        self.processor = AutoImageProcessor.from_pretrained("magic-leap-community/superglue_outdoor")
        self.model = AutoModel.from_pretrained("magic-leap-community/superglue_outdoor")
    
    @torch.inference_mode()
    def get_keypoints(
        self,
        series1: list[Image.Image],
        series2: list[Image.Image]
        ):

        images = []
        for s1, s2 in zip(series1, series2):
            images.append([s1, s2])
        
        processor_inputs = self.processor(images, return_tensors="pt")
        outputs = self.model(**processor_inputs)

        image_sizes = [[(s1.height, s1.width), (s2.height, s2.width)] 
                for s1, s2 in zip(series1, series2)]
        
        processed_outputs = self.processor.post_process_keypoint_matching(
            outputs, image_sizes
        )
        return processed_outputs


urls = [
    [
        "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_98169888_3347710852.jpg",
        "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/united_states_capitol_26757027_6717084061.jpg"
    ],
    [
        "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/piazza_san_marco_06795901_3725050516.jpg",
        "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/piazza_san_marco_15148634_5228701572.jpg"
    ],
    [
        "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/st_pauls_cathedral_30776973_2635313996.jpg",
        "https://raw.githubusercontent.com/magicleap/SuperGluePretrainedNetwork/refs/heads/master/assets/phototourism_sample_images/st_pauls_cathedral_37347628_10902811376.jpg"
    ]
]

pairs = []
for url_pair in urls:
    pair = []
    for url in url_pair:
        im = Image.open(requests.get(url, stream=True).raw)
        pair.append(im)
    pairs.append(pair)

series1 = []
series2 = []

for im1, im2 in pairs:
    series1.append(im1)
    series2.append(im2)

# Inference
test = Test()

kps_batched = test.get_keypoints(series1, series2)

kps_single = []
for im1, im2 in zip(series1, series2):
    kps = test.get_keypoints((im1,), (im2,))[0]
    kps_single.append(kps)


print("\nNon-batched:")
for i, kps in enumerate(kps_single):
    print(f"kps[{i}]: keypoints0={kps['keypoints0'].shape}, keypoints1={kps['keypoints1'].shape}, matching_scores.mean={kps['matching_scores'].mean()}")

print("\nBatched:")
for i, kps in enumerate(kps_batched):
    print(f"kps[{i}]: keypoints0={kps['keypoints0'].shape}, keypoints1={kps['keypoints1'].shape}, matching_scores.mean={kps['matching_scores'].mean()}")

I will get this error:

Traceback (most recent call last):
  File "<>/test.py", line 69, in <module>
    kps_batched = test.get_keypoints(series1, series2)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<>/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "<>/test.py", line 30, in get_keypoints
    processed_outputs = self.processor.post_process_keypoint_matching(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<>/python3.12/site-packages/transformers/models/superglue/image_processing_superglue.py", line 393, in post_process_keypoint_matching
    matched_keypoints1 = keypoints1[matches0[valid_matches]]
                         ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: index 1418 is out of bounds for dimension 0 with size 1371

If, however, I apply the changes I described, the code runs successfully:

Non-batched:
kps[0]: keypoints0=torch.Size([233, 2]), keypoints1=torch.Size([233, 2]), matching_scores.mean=0.4552536904811859
kps[1]: keypoints0=torch.Size([399, 2]), keypoints1=torch.Size([399, 2]), matching_scores.mean=0.4019520580768585
kps[2]: keypoints0=torch.Size([256, 2]), keypoints1=torch.Size([256, 2]), matching_scores.mean=0.3144405484199524

Batched:
kps[0]: keypoints0=torch.Size([234, 2]), keypoints1=torch.Size([234, 2]), matching_scores.mean=0.4524313807487488
kps[1]: keypoints0=torch.Size([399, 2]), keypoints1=torch.Size([399, 2]), matching_scores.mean=0.40192607045173645
kps[2]: keypoints0=torch.Size([256, 2]), keypoints1=torch.Size([256, 2]), matching_scores.mean=0.3144405484199524

Jun 15 '25 12:06 i44p

AssertionError Traceback (most recent call last)

in <cell line: 0>() 39 kps = test.get_keypoints((image1, image1), (image2, image2)) 40 ---> 41 assert torch.equal(kps[0]['keypoints0'], kps[1]['keypoints0']) 42 print("Assertion succeeded!")

AssertionError:

Thanks for responding. When I tried to reproduce this error, I got this error. So I think you were trying to prove that the values are not equal — that’s why you created this issue. I didn’t read the documentation about this error properly. Sorry about that, by the way.

Jun 15 '25 12:06 gspeter-max

cc @sbucaille if you have an idea why it may fail

Jun 16 '25 18:06 qubvel

Hey ! @i44p is totally right, these two lines concatenate the matches and the scores incorrectly. In the current implementation, having 3 pairs of image results in this concatenation [im0-a, im0-b, im1-a] <matches> [im1-b, im2-a, im2-b] instead of [im0-a, im1-a, im2-a] <matches> [im0-b, im1-b, im2-b] Thanks for catching it ! Also sorry I didn't see OP notification, hope it wasn't critical for your work

@qubvel I opened https://github.com/huggingface/transformers/pull/38850 which fixes the issue (full credit to @i44p though 😅)

Jun 16 '25 20:06 sbucaille