ludwig CUDA deserialization fails on GPU machine

Describe the bug I am encountering this error, with commands:

ludwig train (and use_gpu: false)
ludwig experiment (any GPU settings)

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with 
map_location=torch.device('cpu') to map your storages to the CPU.

To Reproduce

model_type: ecd
input_features:
-
    name: quantity
    type: numerical
    encoder:
      type: dense
      dropout: 0.2
      num_layers: 1
      activation: relu

      output_size: 16
output_features:
-
    name: unit
    type: category
    calibration: true
    loss:
      type: softmax_cross_entropy
    top_k: 1

trainer:
    early_stop: 3
    epochs: 100
    batch_size: 32
    learning_rate: 0.001
    optimizer:
        type: adam

backend:
    type: ray
    trainer:
        use_gpu: false

Dataset (randomly generated): data.csv

Commands:

ludwig train \
  --dataset data.csv \
  --config config.yaml

ludwig experiment \
  --dataset data.csv \
  --config config.yaml

Expected behavior I'd expect to be able to run ludwig train and ludwig experiment on the model without CUDA-related errors.

Environment (please complete the following information):

OS: Centos
Version 7
Python version: 3.10.13
Ludwig version: 0.8.6
Ray Version: 2.3.1
Instance has an A10G

Additional context When using ludwig train:

(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 552, in <lambda>
(TorchTrainer pid=18223)   lambda config: tune_batch_size_fn(**config),
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 266, in tune_batch_size_fn
(TorchTrainer pid=18223)   model = ray.get(model_ref)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(TorchTrainer pid=18223)   return func(*args, **kwargs)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/worker.py", line 2382, in get
(TorchTrainer pid=18223)   raise value
(TorchTrainer pid=18223) ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
(TorchTrainer pid=18223) traceback: Traceback (most recent call last):
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
(TorchTrainer pid=18223)   obj = self._deserialize_object(data, metadata, object_ref)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
(TorchTrainer pid=18223)   return self._deserialize_msgpack_data(data, metadata_fields)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
(TorchTrainer pid=18223)   python_objects = self._deserialize_pickle5_data(pickle5_data)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 197, in _deserialize_pickle5_data
(TorchTrainer pid=18223)   obj = pickle.loads(in_band)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/storage.py", line 337, in _load_from_bytes
(TorchTrainer pid=18223)   return torch.load(io.BytesIO(b))
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
(TorchTrainer pid=18223)   return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1256, in _legacy_load
(TorchTrainer pid=18223)   result = unpickler.load()
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1193, in persistent_load
(TorchTrainer pid=18223)   wrap_storage=restore_location(obj, location),
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
(TorchTrainer pid=18223)   result = fn(storage, location)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
(TorchTrainer pid=18223)   device = validate_cuda_device(location)
(TorchTrainer pid=18223)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 258, in validate_cuda_device
(TorchTrainer pid=18223)   raise RuntimeError('Attempting to deserialize object on a CUDA '
(TorchTrainer pid=18223) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

When using ludwig experiment

MapBatches(postprocess_batch):  0%|     | 0/1 [00:10<?, ?it/s]
(_map_task pid=4697) 2023-11-22 22:28:35,512	ERROR serialization.py:371 -- Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
(_map_task pid=4697) Traceback (most recent call last):
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
(_map_task pid=4697)   obj = self._deserialize_object(data, metadata, object_ref)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
(_map_task pid=4697)   return self._deserialize_msgpack_data(data, metadata_fields)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
(_map_task pid=4697)   python_objects = self._deserialize_pickle5_data(pickle5_data)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 197, in _deserialize_pickle5_data
(_map_task pid=4697)   obj = pickle.loads(in_band)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/storage.py", line 337, in _load_from_bytes
(_map_task pid=4697)   return torch.load(io.BytesIO(b))
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
(_map_task pid=4697)   return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1256, in _legacy_load
(_map_task pid=4697)   result = unpickler.load()
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1193, in persistent_load
(_map_task pid=4697)   wrap_storage=restore_location(obj, location),
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
(_map_task pid=4697)   result = fn(storage, location)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
(_map_task pid=4697)   device = validate_cuda_device(location)
(_map_task pid=4697)  File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 258, in validate_cuda_device
(_map_task pid=4697)   raise RuntimeError('Attempting to deserialize object on a CUDA '
(_map_task pid=4697) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

Nov 24 '23 19:11 philippe-solodov-wd

Hi @philippe-solodov-wd– can you describe your use case in more detail? Are you trying to train on a GPU-enabled machine, without using the GPU?

Dec 12 '23 20:12 geoffreyangus

Hey @philippe-solodov-wd, looks like the issue is that the Ludwig entrypoint is initializing model weights on GPU, but the workers are unable to deserialize them because they don't have GPU visiiblity. The fix on our side would be to move the model to CPU before inserting it into the Ray object store if the workers don't have GPU.

As a workaround for now, can you try running with the following:

CUDA_VISIBILE_DEVICES="" ludwig train ...

Dec 12 '23 20:12 tgaddair