Ensure that `OneFormerProcessor` place text `task_inputs` to the same device as other inputs
System Info
transformersversion: 5.0.0.dev0- Platform: Windows-10-10.0.26200-SP0
- Python version: 3.11.9
- Huggingface_hub version: 1.2.1
- Safetensors version: 0.5.3
- Accelerate version: not installed
- Accelerate config: not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
- Using distributed or parallel set-up in script?:
- Using GPU in script?:
- GPU type: NVIDIA GeForce RTX 2060
Who can help?
@yonigozlan @molbap
Information
- [ ] The official example scripts
- [x] My own modified scripts
Tasks
- [x] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Fast image processors can keep the output tensors on the same device as the input tensors.
This works for most of the processors, but OneFormerProcessor, which can combine OneFormerImageProcessorFast and a tokenizer and takes both image and text as inputs, do not ensure that both outputs are on the same device.
Let's create a simple script that loads an image, moves it to cuda and then tries to preprocess it with OneFormerProcessor
import transformers
from PIL import Image
import requests
from torchvision import transforms
to_tensor_transform = transforms.ToTensor()
processor = transformers.OneFormerImageProcessorFast()
processor = transformers.OneFormerProcessor(
image_processor=processor,
tokenizer=transformers.AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32"),
)
url = "https://huggingface.co/datasets/hf-internal-testing/fixtures_ade20k/resolve/main/ADE_val_00000001.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Convert image to tensor and move it to cuda
image = to_tensor_transform(image).to("cuda")
# Semantic Segmentation
inputs = processor(image, ["semantic"], return_tensors="pt")
When we check the inputs generated by the processor, we will see that pixel_values are on cuda, but task_inputs are not
Expected behavior
Probably OneFormerProcessor should not only tokenize the text inputs, but also move them to the same device as the images
cc @molbap @yonigozlan @zucchini-nlp - this does sound like a real bug to me!
Hi I modified the __call__ method in src/transformers/models/oneformer/processing_oneformer.py and added a small snippet to check the device of the pixel_values parameter and then manually moving the task_inputs to the device .
if "pixel_values" in encoded_inputs:
pixel_values = encoded_inputs["pixel_values"]
device = None
if isinstance(pixel_values, torch.Tensor):
device = pixel_values.device
elif isinstance(pixel_values, (list, tuple)) and len(pixel_values) > 0 and isinstance(pixel_values[0], torch.Tensor):
device = pixel_values[0].device
if device is not None:
if "task_inputs" in encoded_inputs and isinstance(encoded_inputs["task_inputs"], torch.Tensor):
encoded_inputs["task_inputs"] = encoded_inputs["task_inputs"].to(device)
if "text_inputs" in encoded_inputs and isinstance(encoded_inputs["text_inputs"], torch.Tensor):
encoded_inputs["text_inputs"] = encoded_inputs["text_inputs"].to(device)
Is this the expected fix ? If it is , I can submit a PR for it
Though it does add a little bit of latency(~0.1 ms) , which is probably expected for moving data from cpu to gpu .
Looks more like a feature request to me. We currently do not accept a device in tokenizers, audio processors and slow image processors, so when an user passes device="cuda" to the processor call, the inputs will be in different device depending on modality
In general, I agree that this is not intuitive and returning all tensors on the same device is the best option. Though we might want to apply it for text models as well, such as tokenizer("hello world", return_tensors="pt", device="cuda") works. I need your opinion on it @Rocketknight1 @molbap @yonigozlan
As for the above example, imo it's not worth trying to move all inputs to the same device as the passed pixels. That means we need to change each processor's call method, doesn't fit well with many modalities and feels like hidden internal magic which I prefer to avoid
Ah, good point - I forgot that image processors might often put outputs on cuda because they're using torchvision, but tokenizers won't. I think either is fine, but it's definitely confusing for users to get a mix of different devices in the output.
Not sure what the right solution here is! Maybe in processors that use torchvision, we always move outputs to the same dvice that the vision tensor is on?
In general, I agree that this is not intuitive and returning all tensors on the same device is the best option. Though we might want to apply it for text models as well, such as tokenizer("hello world", return_tensors="pt", device="cuda") works. I need your opinion on it @Rocketknight1 @molbap @yonigozlan
tokenizer("hello world", return_tensors="pt", device="cuda") would then be a bit redundant with tokenizer("hello world", return_tensors="pt").to("cuda")? It's not equivalent for fast image processor as the processing itself is done on cuda when device="cuda" is passed. I think it might be a bit misleading to support "device" for tokenizers or audio processor call when the processing won't be done on the given device.
Yeah, the issue is that these will not be equivalent. For tokenizers I think there won't be a cuda-processing at any point in the future. Maybe we can then either give a better name or state explicitly in docs what device does. Currently I would expect the following code to return all inputs on "cuda"
inputs = processor(text, images, return_tensors="pt", device="cuda")
tokenizer("hello world", return_tensors="pt", device="cuda") would then be a bit redundant with tokenizer("hello world", return_tensors="pt").to("cuda")?
IMO we can make it analogous to how tensors can be initialized, either init on device or moved to device afterwards
Currently I would expect the following code to return all inputs on "cuda"
inputs = processor(text, images, return_tensors="pt", device="cuda")
Good point, I'm not against this but we should make it clear in the docstrings that processing is done on the given device only when supported.
cc @Rocketknight1 @simonreise @yonigozlan @zucchini-nlp hi, Can my modifications solve this problem? If possible, could you help me merge the code
For completeness, I agree that of making processors responsible for uniform device placement is quite limited, and the costs are high.
If we add it, it should be in ProcessorMixin and indeed well-documented, just as a helper to move everything to the user-provided device. It should be explicit that it's just a simple "move" and not anything fancy like the fast image processors.
@molbap Thank you for the great suggestion! You're absolutely right that moving the device consistency logic to ProcessorMixin is a much better architectural choice. I'd be happy to refactor this PR to implement it as a general solution in the base class with proper documentation. This way we can solve the OneFormer issue while benefiting all other processors too. Thanks for the guidance!
yeah, we can add it in BatchFeature.__init__ and use it when converting inputs to tensors and add in the documentation as well. We would need to also move device as a common kwarg along with return_tensors
So I've added the device parameter to BatchFeature.__init__ and then added the device moving logic in BatchFeature.convert_to_tensors and then in OneFormerProcessor.__call__ pass the device parameter .
A lot of the models overwrite the __call__ of ProcessorMixin so I think changing that alone wouldn't fix the issue .
And then this returns all tensors on cuda
inputs = processor(image, ["semantic"], return_tensors="pt",device = "cuda")
If this looks good , I can submit a PR .