unsloth icon indicating copy to clipboard operation
unsloth copied to clipboard

raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")

Open xiechengmude opened this issue 1 year ago • 13 comments

checkpoint shards: 73%|███████▎ | 11/15 [00:02<00:00, 5.57it/s]^MLoading checkpoint shards: 100%|██████████| 15/15 [00:02<00:00, 7.54it/s]^MLoading checkpoint shards: 100%|██████████| 15/15 [00:02<00:00, 5.83it/s] ^MLoading checkpoint shards: 80%|████████ | 12/15 [00:02<00:00, 6.11it/s]Traceback (most recent call last): File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in main() File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 34, in run_exp run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 28, in run_dpo model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train) File "/workspace/LLaMA-Factory/src/llmtuner/model/loader.py", line 75, in load_model_and_tokenizer model, _ = FastLlamaModel.from_pretrained(**unsloth_kwargs) File "/workspace/unsloth/unsloth/models/llama.py", line 672, in from_pretrained model = AutoModelForCausalLM.from_pretrained( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3776, in from_pretrained dispatch_model(model, **device_map_kwargs) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/big_modeling.py", line 399, in dispatch_model attach_align_device_hook_on_blocks( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 517, in attach_align_device_hook_on_blocks add_hook_to_module(module, hook) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in add_hook_to_module module = hook.init_hook(module) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 254, in init_hook set_module_tensor_to_device(module, name, self.execution_device) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 306, in set_module_tensor_to_device raise ValueError(f"{tensor_name} is on the meta device, we need a value to put in on {device}.") ValueError: weight is on the meta device, we need a value to put in on 6. ^MLoading checkpoint shards: 87%|████████▋ | 13/15 [00:02<00:00, 6.50it/s][2024-01-14 15:52:43,520] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2166012 [2024-01-14 15:52:44,288] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2166013

xiechengmude avatar Jan 14 '24 15:01 xiechengmude

@xiechengmude I'm going to guess your model is offloading to RAM, hence the error message. How big is ur model and can your GPU fit it?

danielhanchen avatar Jan 15 '24 02:01 danielhanchen

H100 80g ,training a 30b model with qlora in zero2 way

xiechengmude avatar Jan 15 '24 04:01 xiechengmude

using the thrid party trainning service : llama_factory

how could I fix the issue here?

xiechengmude avatar Jan 15 '24 05:01 xiechengmude

@xiechengmude Ohh it's possible Deepspeed is the issue - I have not really tested DS with Unsloth, so it might be offloading some GPU tensors to CPU - is it possible to use Deepspeed without offloading for QLoRA? The LoRA weights only take a few MB, so offloading is not necessary for it

danielhanchen avatar Jan 15 '24 07:01 danielhanchen

Heres the original zero2.json config :{ "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu" }, "contiguous_gradients": true, "overlap_comm": true }, "bf16": { "enabled": "auto" }, "fp16": { "enabled": "auto", "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "gradient_accumulation_steps": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }

xiechengmude avatar Jan 15 '24 09:01 xiechengmude

should I delete the key name of offload_optimizer ?

xiechengmude avatar Jan 15 '24 09:01 xiechengmude

it still report error . like that:█████▋ | 13/15 [00:02<00:00, 7.16it/s]Traceback (most recent call last): File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module> main() File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 34, in run_exp run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 28, in run_dpo model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train) File "/workspace/LLaMA-Factory/src/llmtuner/model/loader.py", line 75, in load_model_and_tokenizer model, _ = FastLlamaModel.from_pretrained(**unsloth_kwargs) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/llama.py", line 672, in from_pretrained model = AutoModelForCausalLM.from_pretrained( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3776, in from_pretrained dispatch_model(model, **device_map_kwargs) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/big_modeling.py", line 399, in dispatch_model attach_align_device_hook_on_blocks( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 517, in attach_align_device_hook_on_blocks add_hook_to_module(module, hook) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in add_hook_to_module module = hook.init_hook(module) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 254, in init_hook set_module_tensor_to_device(module, name, self.execution_device) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 306, in set_module_tensor_to_device raise ValueError(f"{tensor_name} is on the meta device, we need a valueto put in on {device}.") ValueError: weight is on the meta device, we need avalue to put in on 5.

xiechengmude avatar Jan 15 '24 10:01 xiechengmude

@xiechengmude Hmm ok thats weird - have you tried Zero stage 1?

danielhanchen avatar Jan 16 '24 09:01 danielhanchen

Nope. Did stage 1 work here?

xiechengmude avatar Jan 18 '24 16:01 xiechengmude

any params for stage 1?

xiechengmude avatar Jan 18 '24 16:01 xiechengmude

image

My environt: H100 CUDA 12.2 Pytorch2.1.2 Llama_factory with unsloth

xiechengmude avatar Jan 18 '24 16:01 xiechengmude

AFTER I CREATED A NEW CONDA ENVIRONTMENT FOR IT .IT SEEMS TO WORK. BUT

ANOTHER ERROR: [2024-01-18 16:42:42,390] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2024-01-18 16:42:42,391] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 2 optimizer [2024-01-18 16:42:42,391] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000 [2024-01-18 16:42:42,391] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000 [2024-01-18 16:42:42,391] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: True [2024-01-18 16:42:42,391] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False Traceback (most recent call last): File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module> main() File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 28, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 71, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/transformers/trainer.py", line 2744, in training_step self.accelerator.backward(loss) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1958, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 140, in decorate_bwd return bwd(*args, **kwargs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 149, in backward d_downA = h.t() @ (dY @ downB.t()) RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float [2024-01-18 16:43:18,217] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520017 [2024-01-18 16:43:20,241] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520018 [2024-01-18 16:43:22,137] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520019 [2024-01-18 16:43:23,694] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520020 [2024-01-18 16:43:24,925] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520021 [2024-01-18 16:43:24,926] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520022 [2024-01-18 16:43:26,227] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520023 [2024-01-18 16:43:27,261] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520024

xiechengmude avatar Jan 18 '24 16:01 xiechengmude

@xiechengmude Oh is torch.cuda.amp.autocast not working? Did you use bf16 = True in the training args

danielhanchen avatar Jan 18 '24 16:01 danielhanchen