COMET icon indicating copy to clipboard operation
COMET copied to clipboard

torch.set_default_device("cuda") error

Open r-carpentier opened this issue 8 months ago • 0 comments

🐛 Bug

Using torch.set_default_device to set anything else than "cpu" leads to a ValueError when calling model.predict.

To Reproduce

I'm using the code sample provided on the huggingface page of the model wmt22-cometkiwi-da:

import torch
from comet import download_model, load_from_checkpoint

torch.set_default_device("cuda")

model_path = download_model("Unbabel/wmt22-cometkiwi-da")
model = load_from_checkpoint(model_path)
data = [
    {
        "src": "The output signal provides constant sync so the display never glitches.",
        "mt": "Das Ausgangssignal bietet eine konstante Synchronisation, so dass die Anzeige nie stört."
    },
    {
        "src": "Kroužek ilustrace je určen všem milovníkům umění ve věku od 10 do 15 let.",
        "mt": "Кільце ілюстрації призначене для всіх любителів мистецтва у віці від 10 до 15 років."
    },
    {
        "src": "Mandela then became South Africa's first black president after his African National Congress party won the 1994 election.",
        "mt": "その後、1994年の選挙でアフリカ国民会議派が勝利し、南アフリカ初の黒人大統領となった。"
    }
]
model_output = model.predict(data, batch_size=8, gpus=1)
print (model_output)

The output is:

Lightning automatically upgraded your loaded checkpoint from v1.8.2 to v2.5.1.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../../../../scratch1/robin/mlcache/huggingface/hub/models--Unbabel--wmt22-cometkiwi-da/snapshots/1ad785194e391eebc6c53e2d0776cada8f83179a/checkpoints/model.ckpt`
Encoder model frozen.
/home/test/miniconda3/envs/mware/lib/python3.13/site-packages/pytorch_lightning/core/saving.py:195: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']
Using default `ModelCheckpoint`. Consider installing `litmodels` package to enable `LitModelCheckpoint` for automatic upload to the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
You are using a CUDA device ('NVIDIA RTX A5000') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
Predicting: 0it [00:00, ?it/s]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 22
      7 model = load_from_checkpoint(model_path)
      8 data = [
      9     {
     10         "src": "The output signal provides constant sync so the display never glitches.",
   (...)     20     }
     21 ]
---> 22 model_output = model.predict(data, batch_size=8, gpus=1)
     23 print (model_output)

File ~/miniconda3/envs/mware/lib/python3.13/site-packages/comet/models/base.py:655, in CometModel.predict(self, samples, batch_size, gpus, devices, mc_dropout, progress_bar, accelerator, num_workers, length_batching)
    646 trainer = ptl.Trainer(
    647     devices=devices,
    648     logger=False,
   (...)    652     enable_progress_bar=enable_progress_bar,
    653 )
    654 return_predictions = False if gpus > 1 else True
--> 655 predictions = trainer.predict(
    656     self, dataloaders=dataloader, return_predictions=return_predictions
    657 )
    658 if gpus > 1:
    659     torch.distributed.barrier()  # Waits for all processes to finish predict
...
    raise ValueError(
    ...<4 lines>...
    ) from e
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Expected behaviour

I would expect no error given that the model is using cuda anyway.

Environment

OS: Ubuntu 22.04 Packaging: conda environment, packages installed with pip Version: unbabel-comet 2.2.6 ; torch 2.7.0

Additional context

I use torch.set_default_device("cuda") to ensure that other models are always loaded on GPU.

r-carpentier avatar May 09 '25 01:05 r-carpentier