transformers Ensure that `OneFormerProcessor` place text `task_inputs` to the same device as other inputs

System Info

transformers version: 5.0.0.dev0
Platform: Windows-10-10.0.26200-SP0
Python version: 3.11.9
Huggingface_hub version: 1.2.1
Safetensors version: 0.5.3
Accelerate version: not installed
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA GeForce RTX 2060

Who can help?

@yonigozlan @molbap

Information

[ ] The official example scripts
[x] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Fast image processors can keep the output tensors on the same device as the input tensors.

This works for most of the processors, but OneFormerProcessor, which can combine OneFormerImageProcessorFast and a tokenizer and takes both image and text as inputs, do not ensure that both outputs are on the same device.

Let's create a simple script that loads an image, moves it to cuda and then tries to preprocess it with OneFormerProcessor

import transformers
from PIL import Image
import requests
from torchvision import transforms

to_tensor_transform = transforms.ToTensor()

processor = transformers.OneFormerImageProcessorFast()
processor = transformers.OneFormerProcessor(
    image_processor=processor,
    tokenizer=transformers.AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32"),
)

url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Convert image to tensor and move it to cuda
image = to_tensor_transform(image).to("cuda")

# Semantic Segmentation
inputs = processor(image, ["semantic"], return_tensors="pt")

When we check the inputs generated by the processor, we will see that pixel_values are on cuda, but task_inputs are not

Expected behavior

Probably OneFormerProcessor should not only tokenize the text inputs, but also move them to the same device as the images

Dec 08 '25 22:12 simonreise

cc @molbap @yonigozlan @zucchini-nlp - this does sound like a real bug to me!

Dec 09 '25 12:12 Rocketknight1

Hi I modified the __call__ method in src/transformers/models/oneformer/processing_oneformer.py and added a small snippet to check the device of the pixel_values parameter and then manually moving the task_inputs to the device .

if "pixel_values" in encoded_inputs:
            pixel_values = encoded_inputs["pixel_values"]
            device = None
            if isinstance(pixel_values, torch.Tensor):
                device = pixel_values.device
            elif isinstance(pixel_values, (list, tuple)) and len(pixel_values) > 0 and isinstance(pixel_values[0], torch.Tensor):
                device = pixel_values[0].device
            
            if device is not None:
                if "task_inputs" in encoded_inputs and isinstance(encoded_inputs["task_inputs"], torch.Tensor):
                    encoded_inputs["task_inputs"] = encoded_inputs["task_inputs"].to(device)
                if "text_inputs" in encoded_inputs and isinstance(encoded_inputs["text_inputs"], torch.Tensor):
                    encoded_inputs["text_inputs"] = encoded_inputs["text_inputs"].to(device)

Is this the expected fix ? If it is , I can submit a PR for it

Though it does add a little bit of latency(~0.1 ms) , which is probably expected for moving data from cpu to gpu .

Dec 09 '25 12:12 i3hz

Looks more like a feature request to me. We currently do not accept a device in tokenizers, audio processors and slow image processors, so when an user passes device="cuda" to the processor call, the inputs will be in different device depending on modality

In general, I agree that this is not intuitive and returning all tensors on the same device is the best option. Though we might want to apply it for text models as well, such as tokenizer("hello world", return_tensors="pt", device="cuda") works. I need your opinion on it @Rocketknight1 @molbap @yonigozlan

As for the above example, imo it's not worth trying to move all inputs to the same device as the passed pixels. That means we need to change each processor's call method, doesn't fit well with many modalities and feels like hidden internal magic which I prefer to avoid

Dec 09 '25 13:12 zucchini-nlp

Ah, good point - I forgot that image processors might often put outputs on cuda because they're using torchvision, but tokenizers won't. I think either is fine, but it's definitely confusing for users to get a mix of different devices in the output.

Not sure what the right solution here is! Maybe in processors that use torchvision, we always move outputs to the same dvice that the vision tensor is on?

Dec 11 '25 14:12 Rocketknight1

In general, I agree that this is not intuitive and returning all tensors on the same device is the best option. Though we might want to apply it for text models as well, such as tokenizer("hello world", return_tensors="pt", device="cuda") works. I need your opinion on it @Rocketknight1 @molbap @yonigozlan

tokenizer("hello world", return_tensors="pt", device="cuda") would then be a bit redundant with tokenizer("hello world", return_tensors="pt").to("cuda")? It's not equivalent for fast image processor as the processing itself is done on cuda when device="cuda" is passed. I think it might be a bit misleading to support "device" for tokenizers or audio processor call when the processing won't be done on the given device.

Dec 11 '25 16:12 yonigozlan

Yeah, the issue is that these will not be equivalent. For tokenizers I think there won't be a cuda-processing at any point in the future. Maybe we can then either give a better name or state explicitly in docs what device does. Currently I would expect the following code to return all inputs on "cuda"

inputs = processor(text, images, return_tensors="pt", device="cuda")

tokenizer("hello world", return_tensors="pt", device="cuda") would then be a bit redundant with tokenizer("hello world", return_tensors="pt").to("cuda")?

IMO we can make it analogous to how tensors can be initialized, either init on device or moved to device afterwards

Dec 12 '25 12:12 zucchini-nlp

Currently I would expect the following code to return all inputs on "cuda"

inputs = processor(text, images, return_tensors="pt", device="cuda")

Good point, I'm not against this but we should make it clear in the docstrings that processing is done on the given device only when supported.

Dec 12 '25 15:12 yonigozlan

cc @Rocketknight1 @simonreise @yonigozlan @zucchini-nlp hi, Can my modifications solve this problem? If possible, could you help me merge the code

Dec 13 '25 14:12 leejianwoo-collab

For completeness, I agree that of making processors responsible for uniform device placement is quite limited, and the costs are high.

If we add it, it should be in ProcessorMixin and indeed well-documented, just as a helper to move everything to the user-provided device. It should be explicit that it's just a simple "move" and not anything fancy like the fast image processors.

Dec 15 '25 11:12 molbap

@molbap Thank you for the great suggestion! You're absolutely right that moving the device consistency logic to ProcessorMixin is a much better architectural choice. I'd be happy to refactor this PR to implement it as a general solution in the base class with proper documentation. This way we can solve the OneFormer issue while benefiting all other processors too. Thanks for the guidance!

Dec 15 '25 13:12 leejianwoo-collab

yeah, we can add it in BatchFeature.__init__ and use it when converting inputs to tensors and add in the documentation as well. We would need to also move device as a common kwarg along with return_tensors

Dec 15 '25 14:12 zucchini-nlp

So I've added the device parameter to BatchFeature.__init__ and then added the device moving logic in BatchFeature.convert_to_tensors and then in OneFormerProcessor.__call__ pass the device parameter . A lot of the models overwrite the __call__ of ProcessorMixin so I think changing that alone wouldn't fix the issue .

And then this returns all tensors on cuda inputs = processor(image, ["semantic"], return_tensors="pt",device = "cuda")

If this looks good , I can submit a PR .

Dec 16 '25 09:12 i3hz