CUDA deserialization fails on GPU machine
Describe the bug I am encountering this error, with commands:
ludwig train(anduse_gpu: false)ludwig experiment(any GPU settings)
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with
map_location=torch.device('cpu') to map your storages to the CPU.
To Reproduce
model_type: ecd
input_features:
-
name: quantity
type: numerical
encoder:
type: dense
dropout: 0.2
num_layers: 1
activation: relu
output_size: 16
output_features:
-
name: unit
type: category
calibration: true
loss:
type: softmax_cross_entropy
top_k: 1
trainer:
early_stop: 3
epochs: 100
batch_size: 32
learning_rate: 0.001
optimizer:
type: adam
backend:
type: ray
trainer:
use_gpu: false
Dataset (randomly generated): data.csv
Commands:
ludwig train \
--dataset data.csv \
--config config.yaml
ludwig experiment \
--dataset data.csv \
--config config.yaml
Expected behavior
I'd expect to be able to run ludwig train and ludwig experiment on the model without CUDA-related errors.
Environment (please complete the following information):
- OS: Centos
- Version 7
- Python version: 3.10.13
- Ludwig version: 0.8.6
- Ray Version: 2.3.1
- Instance has an A10G
Additional context
When using ludwig train:
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 552, in <lambda>
(TorchTrainer pid=18223) lambda config: tune_batch_size_fn(**config),
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ludwig/backend/ray.py", line 266, in tune_batch_size_fn
(TorchTrainer pid=18223) model = ray.get(model_ref)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(TorchTrainer pid=18223) return func(*args, **kwargs)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/worker.py", line 2382, in get
(TorchTrainer pid=18223) raise value
(TorchTrainer pid=18223) ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
(TorchTrainer pid=18223) traceback: Traceback (most recent call last):
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
(TorchTrainer pid=18223) obj = self._deserialize_object(data, metadata, object_ref)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
(TorchTrainer pid=18223) return self._deserialize_msgpack_data(data, metadata_fields)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
(TorchTrainer pid=18223) python_objects = self._deserialize_pickle5_data(pickle5_data)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 197, in _deserialize_pickle5_data
(TorchTrainer pid=18223) obj = pickle.loads(in_band)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/storage.py", line 337, in _load_from_bytes
(TorchTrainer pid=18223) return torch.load(io.BytesIO(b))
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
(TorchTrainer pid=18223) return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1256, in _legacy_load
(TorchTrainer pid=18223) result = unpickler.load()
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1193, in persistent_load
(TorchTrainer pid=18223) wrap_storage=restore_location(obj, location),
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
(TorchTrainer pid=18223) result = fn(storage, location)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
(TorchTrainer pid=18223) device = validate_cuda_device(location)
(TorchTrainer pid=18223) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 258, in validate_cuda_device
(TorchTrainer pid=18223) raise RuntimeError('Attempting to deserialize object on a CUDA '
(TorchTrainer pid=18223) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
When using ludwig experiment
MapBatches(postprocess_batch): 0%| | 0/1 [00:10<?, ?it/s]
(_map_task pid=4697) 2023-11-22 22:28:35,512 ERROR serialization.py:371 -- Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
(_map_task pid=4697) Traceback (most recent call last):
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 369, in deserialize_objects
(_map_task pid=4697) obj = self._deserialize_object(data, metadata, object_ref)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 252, in _deserialize_object
(_map_task pid=4697) return self._deserialize_msgpack_data(data, metadata_fields)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 207, in _deserialize_msgpack_data
(_map_task pid=4697) python_objects = self._deserialize_pickle5_data(pickle5_data)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/ray/_private/serialization.py", line 197, in _deserialize_pickle5_data
(_map_task pid=4697) obj = pickle.loads(in_band)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/storage.py", line 337, in _load_from_bytes
(_map_task pid=4697) return torch.load(io.BytesIO(b))
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1028, in load
(_map_task pid=4697) return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1256, in _legacy_load
(_map_task pid=4697) result = unpickler.load()
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 1193, in persistent_load
(_map_task pid=4697) wrap_storage=restore_location(obj, location),
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 381, in default_restore_location
(_map_task pid=4697) result = fn(storage, location)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 274, in _cuda_deserialize
(_map_task pid=4697) device = validate_cuda_device(location)
(_map_task pid=4697) File "/home/ml/virtualenv/lib/python3.10/site-packages/torch/serialization.py", line 258, in validate_cuda_device
(_map_task pid=4697) raise RuntimeError('Attempting to deserialize object on a CUDA '
(_map_task pid=4697) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Hi @philippe-solodov-wd– can you describe your use case in more detail? Are you trying to train on a GPU-enabled machine, without using the GPU?
Hey @philippe-solodov-wd, looks like the issue is that the Ludwig entrypoint is initializing model weights on GPU, but the workers are unable to deserialize them because they don't have GPU visiiblity. The fix on our side would be to move the model to CPU before inserting it into the Ray object store if the workers don't have GPU.
As a workaround for now, can you try running with the following:
CUDA_VISIBILE_DEVICES="" ludwig train ...