Med-ChatGLM
Med-ChatGLM copied to clipboard
使用RTX 4090D(24GB)运行微调,出现错误,提示超出内存,这该如何解决
root@autodl-container-ea9346a03f-6901b0ef:~/autodl-tmp/talk_robot/Med-ChatGLM# sh scripts/sft_medchat.sh W&B offline. Running your script from this directory will only write metadata locally. Use wandb disabled to completely turn off W&B.
===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/nvidia/lib64'), PosixPath('/usr/local/nvidia/lib')}
warn(msg)
/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/local/nvidia/lib:/usr/local/nvidia/lib64 did not contain libcudart.so as expected! Searching further paths...
warn(msg)
/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//hf-mirror.com'), PosixPath('https')}
warn(msg)
/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('8443'), PosixPath('https'), PosixPath('//u376296-a03f-6901b0ef.westc.gpuhub.com')}
warn(msg)
/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('Asia/Shanghai')}
warn(msg)
/root/miniconda3/lib/python3.8/site-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//autodl-container-ea9346a03f-6901b0ef'), PosixPath('http'), PosixPath('8888/jupyter')}
warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 116
CUDA SETUP: Loading binary /root/miniconda3/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda116.so...
/root/miniconda3/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download
is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True
.
warnings.warn(
Explicitly passing a revision
is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
05/07/2024 13:30:53 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1distributed training: False, 16-bits training: False
05/07/2024 13:30:53 - INFO - main - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
adafactor=False,
adam_beta1=0.9,
adam_beta2=0.999,
adam_epsilon=0.001,
auto_find_batch_size=False,
bf16=False,
bf16_full_eval=False,
data_seed=None,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_bucket_cap_mb=None,
ddp_find_unused_parameters=None,
ddp_timeout=1800,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=False,
do_predict=False,
do_train=True,
eval_accumulation_steps=None,
eval_delay=0,
eval_steps=None,
evaluation_strategy=no,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
fsdp=[],
fsdp_config={'fsdp_min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False},
fsdp_min_num_params=0,
fsdp_transformer_layer_cls_to_wrap=None,
full_determinism=False,
gradient_accumulation_steps=4,
gradient_checkpointing=False,
greater_is_better=None,
group_by_length=False,
half_precision_backend=auto,
hub_model_id=None,
hub_private_repo=False,
hub_strategy=every_save,
hub_token=<HUB_TOKEN>,
ignore_data_skip=False,
include_inputs_for_metrics=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=5e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=-1,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_dir=./log,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
no_cuda=False,
num_train_epochs=3.0,
optim=adamw_hf,
optim_args=None,
output_dir=./output/,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=<PUSH_TO_HUB_TOKEN>,
ray_scope=last,
remove_unused_columns=False,
report_to=['wandb'],
resume_from_checkpoint=None,
run_name=chatglm_tuning,
save_on_each_node=False,
save_steps=500,
save_strategy=epoch,
save_total_limit=None,
seed=2023,
sharded_ddp=[],
skip_memory_metrics=True,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
warmup_ratio=0.1,
warmup_steps=0,
weight_decay=0.0,
xpu_backend=None,
)
[INFO|configuration_utils.py:666] 2024-05-07 13:30:54,048 >> loading configuration file ./model/config.json
[INFO|configuration_utils.py:666] 2024-05-07 13:30:54,098 >> loading configuration file ./model/config.json
[INFO|configuration_utils.py:720] 2024-05-07 13:30:54,099 >> Model config ChatGLMConfig {
"_name_or_path": "./model/",
"architectures": [
"ChatGLMForConditionalGeneration"
],
"auto_map": {
"AutoConfig": "configuration_chatglm.ChatGLMConfig",
"AutoModel": "modeling_chatglm.ChatGLMForConditionalGeneration",
"AutoModelForSeq2SeqLM": "modeling_chatglm.ChatGLMForConditionalGeneration"
},
"bos_token_id": 150004,
"eos_token_id": 150005,
"hidden_size": 4096,
"inner_hidden_size": 16384,
"layernorm_epsilon": 1e-05,
"max_sequence_length": 2048,
"model_type": "chatglm",
"num_attention_heads": 32,
"num_layers": 28,
"pad_token_id": 0,
"position_encoding_2d": true,
"torch_dtype": "float16",
"transformers_version": "4.27.1",
"use_cache": false,
"vocab_size": 150528
}
[INFO|tokenization_utils_base.py:1800] 2024-05-07 13:30:54,375 >> loading file ice_text.model
[INFO|tokenization_utils_base.py:1800] 2024-05-07 13:30:54,375 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:1800] 2024-05-07 13:30:54,375 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:1800] 2024-05-07 13:30:54,375 >> loading file tokenizer_config.json
[WARNING|modeling_utils.py:2092] 2024-05-07 13:30:55,492 >> The argument trust_remote_code
is to be used with Auto classes. It has no effect here and is ignored.
[INFO|modeling_utils.py:2400] 2024-05-07 13:30:55,493 >> loading weights file ./model/pytorch_model.bin.index.json
[INFO|modeling_utils.py:2443] 2024-05-07 13:30:55,493 >> Will use torch_dtype=torch.float16 as defined in model's config object
[INFO|modeling_utils.py:1126] 2024-05-07 13:30:55,493 >> Instantiating ChatGLMForConditionalGeneration model under default dtype torch.float16.
[INFO|configuration_utils.py:575] 2024-05-07 13:30:55,494 >> Generate config GenerationConfig {
"_from_model_config": true,
"bos_token_id": 150004,
"eos_token_id": 150005,
"pad_token_id": 0,
"transformers_version": "4.27.1",
"use_cache": false
}
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:06<00:00, 3.38s/it] [INFO|modeling_utils.py:3032] 2024-05-07 13:31:02,355 >> All model checkpoint weights were used when initializing ChatGLMForConditionalGeneration.
[INFO|modeling_utils.py:3040] 2024-05-07 13:31:02,356 >> All the weights of ChatGLMForConditionalGeneration were initialized from the model checkpoint at ./model/. If your task is similar to the task the model of the checkpoint was trained on, you can already use ChatGLMForConditionalGeneration for predictions without further training. [INFO|configuration_utils.py:535] 2024-05-07 13:31:02,423 >> loading configuration file ./model/generation_config.json [INFO|configuration_utils.py:575] 2024-05-07 13:31:02,423 >> Generate config GenerationConfig { "_from_model_config": true, "bos_token_id": 150004, "eos_token_id": 150005, "pad_token_id": 0, "transformers_version": "4.27.1" }
/root/miniconda3/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True
to disable this warning
warnings.warn(
[INFO|trainer.py:1740] 2024-05-07 13:31:04,158 >> ***** Running training *****
[INFO|trainer.py:1741] 2024-05-07 13:31:04,159 >> Num examples = 2621
[INFO|trainer.py:1742] 2024-05-07 13:31:04,159 >> Num Epochs = 3
[INFO|trainer.py:1743] 2024-05-07 13:31:04,159 >> Instantaneous batch size per device = 1
[INFO|trainer.py:1744] 2024-05-07 13:31:04,159 >> Total train batch size (w. parallel, distributed & accumulation) = 4
[INFO|trainer.py:1745] 2024-05-07 13:31:04,159 >> Gradient Accumulation steps = 4
[INFO|trainer.py:1746] 2024-05-07 13:31:04,159 >> Total optimization steps = 1965
[INFO|trainer.py:1747] 2024-05-07 13:31:04,160 >> Number of trainable parameters = 6255206400
[INFO|integrations.py:709] 2024-05-07 13:31:04,161 >> Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
wandb: Tracking run with wandb version 0.16.6
wandb: W&B syncing is set to offline
in this directory.
wandb: Run wandb online
or set WANDB_MODE=online to enable cloud syncing.
0%| | 0/1965 [00:00<?, ?it/s]Traceback (most recent call last):
File "run_clm.py", line 564, in
租的autodl的显卡,