unsloth
unsloth copied to clipboard
raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")
checkpoint shards: 73%|███████▎ | 11/15 [00:02<00:00, 5.57it/s]^MLoading checkpoint shards: 100%|██████████| 15/15 [00:02<00:00, 7.54it/s]^MLoading checkpoint shards: 100%|██████████| 15/15 [00:02<00:00, 5.83it/s]
^MLoading checkpoint shards: 80%|████████ | 12/15 [00:02<00:00, 6.11it/s]Traceback (most recent call last):
File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in value
to put in on {device}.")
ValueError: weight is on the meta device, we need a value
to put in on 6.
^MLoading checkpoint shards: 87%|████████▋ | 13/15 [00:02<00:00, 6.50it/s][2024-01-14 15:52:43,520] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2166012
[2024-01-14 15:52:44,288] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2166013
@xiechengmude I'm going to guess your model is offloading to RAM, hence the error message. How big is ur model and can your GPU fit it?
H100 80g ,training a 30b model with qlora in zero2 way
using the thrid party trainning service : llama_factory
how could I fix the issue here?
@xiechengmude Ohh it's possible Deepspeed is the issue - I have not really tested DS with Unsloth, so it might be offloading some GPU tensors to CPU - is it possible to use Deepspeed without offloading for QLoRA? The LoRA weights only take a few MB, so offloading is not necessary for it
Heres the original zero2.json config :{ "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu" }, "contiguous_gradients": true, "overlap_comm": true }, "bf16": { "enabled": "auto" }, "fp16": { "enabled": "auto", "auto_cast": false, "loss_scale": 0, "initial_scale_power": 32, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "gradient_accumulation_steps": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "wall_clock_breakdown": false }
should I delete the key name of offload_optimizer ?
it still report error . like that:█████▋ | 13/15 [00:02<00:00, 7.16it/s]Traceback (most recent call last): File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module> main() File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 34, in run_exp run_dpo(model_args, data_args, training_args, finetuning_args, callbacks) File "/workspace/LLaMA-Factory/src/llmtuner/train/dpo/workflow.py", line 28, in run_dpo model, tokenizer = load_model_and_tokenizer(model_args, finetuning_args, training_args.do_train) File "/workspace/LLaMA-Factory/src/llmtuner/model/loader.py", line 75, in load_model_and_tokenizer model, _ = FastLlamaModel.from_pretrained(**unsloth_kwargs) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/unsloth/models/llama.py", line 672, in from_pretrained model = AutoModelForCausalLM.from_pretrained( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained return model_class.from_pretrained( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3776, in from_pretrained dispatch_model(model, **device_map_kwargs) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/big_modeling.py", line 399, in dispatch_model attach_align_device_hook_on_blocks( File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 517, in attach_align_device_hook_on_blocks add_hook_to_module(module, hook) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 156, in add_hook_to_module module = hook.init_hook(module) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/hooks.py", line 254, in init_hook set_module_tensor_to_device(module, name, self.execution_device) File "/root/miniconda3/envs/llama_factory/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 306, in set_module_tensor_to_device raise ValueError(f"{tensor_name} is on the meta device, we need a
valueto put in on {device}.") ValueError: weight is on the meta device, we need a
value to put in on 5.
@xiechengmude Hmm ok thats weird - have you tried Zero stage 1?
Nope. Did stage 1 work here?
any params for stage 1?
My environt: H100 CUDA 12.2 Pytorch2.1.2 Llama_factory with unsloth
AFTER I CREATED A NEW CONDA ENVIRONTMENT FOR IT .IT SEEMS TO WORK. BUT
ANOTHER ERROR:
[2024-01-18 16:42:42,390] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'> [2024-01-18 16:42:42,391] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.float32 ZeRO stage 2 optimizer [2024-01-18 16:42:42,391] [INFO] [stage_1_and_2.py:148:__init__] Reduce bucket size 500,000,000 [2024-01-18 16:42:42,391] [INFO] [stage_1_and_2.py:149:__init__] Allgather bucket size 500,000,000 [2024-01-18 16:42:42,391] [INFO] [stage_1_and_2.py:150:__init__] CPU Offload: True [2024-01-18 16:42:42,391] [INFO] [stage_1_and_2.py:151:__init__] Round robin gradient partitioning: False Traceback (most recent call last): File "/workspace/LLaMA-Factory/src/train_bash.py", line 14, in <module> main() File "/workspace/LLaMA-Factory/src/train_bash.py", line 5, in main run_exp() File "/workspace/LLaMA-Factory/src/llmtuner/train/tuner.py", line 28, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/workspace/LLaMA-Factory/src/llmtuner/train/sft/workflow.py", line 71, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train return inner_training_loop( File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/transformers/trainer.py", line 1854, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/transformers/trainer.py", line 2744, in training_step self.accelerator.backward(loss) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/accelerate/accelerator.py", line 1958, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 2019, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 288, in backward torch.autograd.backward(outputs_with_grad, args_with_grad) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/autograd/function.py", line 288, in apply return user_fn(self, *args) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py", line 140, in decorate_bwd return bwd(*args, **kwargs) File "/root/miniconda3/envs/etuning/lib/python3.10/site-packages/unsloth/kernels/fast_lora.py", line 149, in backward d_downA = h.t() @ (dY @ downB.t()) RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::BFloat16 != float [2024-01-18 16:43:18,217] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520017 [2024-01-18 16:43:20,241] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520018 [2024-01-18 16:43:22,137] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520019 [2024-01-18 16:43:23,694] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520020 [2024-01-18 16:43:24,925] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520021 [2024-01-18 16:43:24,926] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520022 [2024-01-18 16:43:26,227] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520023 [2024-01-18 16:43:27,261] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3520024
@xiechengmude Oh is torch.cuda.amp.autocast
not working? Did you use bf16 = True
in the training args