使用RTX 4090D(24GB)运行微调，出现错误，提示超出内存，这该如何解决

Open GHremedy opened this issue 1 year ago • 1 comments

root@autodl-container-ea9346a03f-6901b0ef:~/autodl-tmp/talk_robot/Med-ChatGLM# sh scripts/sft_medchat.sh W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')} warn(msg) /root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths... warn(msg) /root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//hf-mirror.com'), PosixPath('https')} warn(msg) /root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8443'), PosixPath('https'), PosixPath('//u376296-a03f-6901b0ef.westc.gpuhub.com')} warn(msg) /root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')} warn(msg) /root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//autodl-container-ea9346a03f-6901b0ef'), PosixPath('http'), PosixPath('8888/jupyter')} warn(msg) CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64... CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so CUDA SETUP: Highest compute capability among GPUs detected: 8.9 CUDA SETUP: Detected CUDA version 116 CUDA SETUP: Loading binary /root/miniconda3/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda116.so... /root/miniconda3/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True. warnings.warn( Explicitly passing a revision is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision. 05/07/2024 13:30:53 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False 05/07/2024 13:30:53 - INFO - main - Training/evaluation parameters TrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=0.001, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, data_seed=None, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=4, gradient_checkpointing=False, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=<HUB_TOKEN>, ignore_data_skip=False, include_inputs_for_metrics=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_dir=./log, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=3.0, optim=adamw_hf, optim_args=None, output_dir=./output/, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=1, per_device_train_batch_size=1, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=<PUSH_TO_HUB_TOKEN>, ray_scope=last, remove_unused_columns=False, report_to=['wandb'], resume_from_checkpoint=None, run_name=chatglm_tuning, save_on_each_node=False, save_steps=500, save_strategy=epoch, save_total_limit=None, seed=2023, sharded_ddp=[], skip_memory_metrics=True, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0, xpu_backend=None, ) [INFO|configuration_utils.py:666] 2024-05-07 13:30:54,048 >> loading configuration file ./model/config.json [INFO|configuration_utils.py:666] 2024-05-07 13:30:54,098 >> loading configuration file ./model/config.json [INFO|configuration_utils.py:720] 2024-05-07 13:30:54,099 >> Model config ChatGLMConfig { "_name_or_path": "./model/", "architectures": [ "ChatGLMForConditionalGeneration" ], "auto_map": { "AutoConfig": "configuration_chatglm.ChatGLMConfig", "AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration", "AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration" }, "bos_token_id": 150004, "eos_token_id": 150005, "hidden_size": 4096, "inner_hidden_size": 16384, "layernorm_epsilon": 1e-05, "max_sequence_length": 2048, "model_type": "chatglm", "num_attention_heads": 32, "num_layers": 28, "pad_token_id": 0, "position_encoding_2d": true, "torch_dtype": "float16", "transformers_version": "4.27.1", "use_cache": false, "vocab_size": 150528 }

[INFO|tokenization_utils_base.py:1800] 2024-05-07 13:30:54,375 >> loading file ice_text.model [INFO|tokenization_utils_base.py:1800] 2024-05-07 13:30:54,375 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:1800] 2024-05-07 13:30:54,375 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:1800] 2024-05-07 13:30:54,375 >> loading file tokenizer_config.json [WARNING|modeling_utils.py:2092] 2024-05-07 13:30:55,492 >> The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored. [INFO|modeling_utils.py:2400] 2024-05-07 13:30:55,493 >> loading weights file ./model/pytorch_model.bin.index.json [INFO|modeling_utils.py:2443] 2024-05-07 13:30:55,493 >> Will use torch_dtype=torch.float16 as defined in model's config object [INFO|modeling_utils.py:1126] 2024-05-07 13:30:55,493 >> Instantiating ChatGLMForConditionalGeneration model under default dtype torch.float16. [INFO|configuration_utils.py:575] 2024-05-07 13:30:55,494 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 150004, "eos_token_id": 150005, "pad_token_id": 0, "transformers_version": "4.27.1", "use_cache": false }

Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.38s/it] [INFO|modeling_utils.py:3032] 2024-05-07 13:31:02,355 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.

[INFO|modeling_utils.py:3040] 2024-05-07 13:31:02,356 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at ./model/. If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:535] 2024-05-07 13:31:02,423 >> loading configuration file ./model/generation_config.json [INFO|configuration_utils.py:575] 2024-05-07 13:31:02,423 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 150004, "eos_token_id": 150005, "pad_token_id": 0, "transformers_version": "4.27.1" }

/root/miniconda3/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning warnings.warn( [INFO|trainer.py:1740] 2024-05-07 13:31:04,158 >> ***** Running training ***** [INFO|trainer.py:1741] 2024-05-07 13:31:04,159 >> Num examples = 2621 [INFO|trainer.py:1742] 2024-05-07 13:31:04,159 >> Num Epochs = 3 [INFO|trainer.py:1743] 2024-05-07 13:31:04,159 >> Instantaneous batch size per device = 1 [INFO|trainer.py:1744] 2024-05-07 13:31:04,159 >> Total train batch size (w. parallel, distributed & accumulation) = 4 [INFO|trainer.py:1745] 2024-05-07 13:31:04,159 >> Gradient Accumulation steps = 4 [INFO|trainer.py:1746] 2024-05-07 13:31:04,159 >> Total optimization steps = 1965 [INFO|trainer.py:1747] 2024-05-07 13:31:04,160 >> Number of trainable parameters = 6255206400 [INFO|integrations.py:709] 2024-05-07 13:31:04,161 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true" wandb: Tracking run with wandb version 0.16.6 wandb: W&B syncing is set to offline in this directory.
wandb: Run wandb online or set WANDB_MODE=online to enable cloud syncing. 0%| | 0/1965 [00:00<?, ?it/s]Traceback (most recent call last): File "run_clm.py", line 564, in main() File "run_clm.py", line 512, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1633, in train return inner_training_loop( File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 1902, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/lib/python3.8/site-packages/transformers/trainer.py", line 2663, in training_step loss.backward() File "/root/miniconda3/lib/python3.8/site-packages/torch/_tensor.py", line 525, in backward torch.autograd.backward( File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/init.py", line 267, in backward _engine_run_backward( File "/root/miniconda3/lib/python3.8/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 128.00 MiB. GPU wandb: You can sync this run to the cloud by running: wandb: wandb sync /root/autodl-tmp/talk_robot/Med-ChatGLM/wandb/offline-run-20240507_133105-d0so2akb wandb: Find logs at: ./wandb/offline-run-20240507_133105-d0so2akb/logs

May 08 '24 08:05 GHremedy

租的autodl的显卡，

May 10 '24 06:05 GHremedy

Med-ChatGLM Med-ChatGLM copied to clipboard

使用RTX 4090D(24GB)运行微调，出现错误，提示超出内存，这该如何解决

===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Med-ChatGLM
Med-ChatGLM copied to clipboard