DeepSpeed
DeepSpeed copied to clipboard
[BUG] ValueError: Expected a cuda device, but got: cpu
Describe the bug
I have trained the llama2 7B with deepspeed and transformers, but got an error at the end of training phase (i.e., 100% training completed) when it displayed an error:
ValueError: Expected a cuda device, but got: cpu
The detail as follows:
99%|█████████▉| 953/958 [2:47:28<00:39, 7.80s/it]
100%|█████████▉| 954/958 [2:47:36<00:30, 7.70s/it]
100%|█████████▉| 955/958 [2:47:44<00:23, 7.76s/it]
100%|█████████▉| 956/958 [2:47:51<00:15, 7.73s/it]
100%|█████████▉| 957/958 [2:47:59<00:07, 7.65s/it]
100%|██████████| 958/958 [2:48:06<00:00, 7.62s/it]
Traceback (most recent call last):
File "/home/maywind/Google Drive/finllm/training_parallel/llama2_mp_8bit_train_lora.py", line 169, in <module>
main()
File "/home/maywind/Google Drive/finllm/training_parallel/llama2_mp_8bit_train_lora.py", line 162, in main
trainer.train()
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
return inner_training_loop(
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/transformers/trainer.py", line 1957, in _inner_training_loop
self._load_best_model()
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/transformers/trainer.py", line 2144, in _load_best_model
deepspeed_load_checkpoint(self.model_wrapped, self.state.best_model_checkpoint)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 402, in deepspeed_load_checkpoint
load_path, _ = deepspeed_engine.load_checkpoint(
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2724, in load_checkpoint
load_path, client_states = self._load_checkpoint(load_dir,
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2794, in _load_checkpoint
self.load_module_state_dict(checkpoint=checkpoint,
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2587, in load_module_state_dict
self.module.load_state_dict(
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2138, in load_state_dict
load(self, state_dict)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2126, in load
load(child, child_state_dict, child_prefix)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2126, in load
load(child, child_state_dict, child_prefix)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2126, in load
load(child, child_state_dict, child_prefix)
[Previous line repeated 5 more times]
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2120, in load
module._load_from_state_dict(
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 415, in _load_from_state_dict
super()._load_from_state_dict(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys,
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1991, in _load_from_state_dict
hook(state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys, error_msgs)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 70, in __call__
return self.hook(*args, **kwargs)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 365, in maybe_rearrange_weight
tile_indices = get_tile_inds(weight_format, weight.device)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 247, in get_tile_inds
return get_inverse_transform_indices(transform, _get_tile_size(format)).to(device)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 79, in get_inverse_transform_indices
permuted_tile_i = transform_tile(sample_tile_i)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 245, in <lambda>
transform = lambda x: F.transform(x.to(device), from_order="row", to_order=format)[0].to(x.device)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2196, in transform
prev_device = pre_call(A.device)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/bitsandbytes/functional.py", line 417, in pre_call
torch.cuda.set_device(device)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/cuda/__init__.py", line 402, in set_device
device = _get_device_index(device)
File "/home/maywind/anaconda3/envs/finllm/lib/python3.10/site-packages/torch/cuda/_utils.py", line 35, in _get_device_index
raise ValueError(f"Expected a cuda device, but got: {device}")
ValueError: Expected a cuda device, but got: CPU
To Reproduce Environment: Train llama2-13B in Quantization 8bit with LoRA Ubuntu 23.04 Dual RTX 4090
Code
if os.environ.get('LOCAL_RANK') is not None:
local_rank = int(os.environ.get('LOCAL_RANK', '0'))
device_map = {'': local_rank}
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
trust_remote_code=True,
device_map=device_map,
)
model.gradient_checkpointing_enable()
model.enable_input_require_grads()
model.is_parallelizable = True
model.model_parallel = True
model.config.use_cache = (
False # silence the warnings. Please re-enable for inference!
)
model = prepare_model_for_kbit_training(model)
print(model)
target_modules = TRANSFORMERS_MODELS_TO_LORA_TARGET_MODULES_MAPPING['llama']
peft_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
target_modules=target_modules,
bias='none',
)
model = get_peft_model(model, peft_config)
deepspeed.json
Expected behavior I trained the llama2-7B in Quantization 8bit with LoRA using deepspeed, facing the same issue. However, it worked fine without deepspeed, training in one RTX 4090.
I would be very appreciative if an awesome expert does me a favor. I am looking forward to your reply. Thanks!