单卡训练weclone-cli train-sft后出现问题
环境如下:
系统:WSL2 + Ubuntu
显卡:RTX 5070 Ti
CUDA:12.1
Python:3.10(用 .venv 和 uv 管理)
控制台显示: (.venv) overcloud@OvercloudPC:~/dev/WeClone$ weclone-cli train-sft [WeClone] I | 20:01:00 | Loading configuration from: ./settings.jsonc /home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/torch/cuda/init.py:235: UserWarning: NVIDIA GeForce RTX 5070 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90. If you want to use the NVIDIA GeForce RTX 5070 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn(
[WeClone] I | 20:01:06 | Loading configuration from: ./settings.jsonc
[WeClone] I | 20:01:06 | Loading configuration from: ./settings.jsonc
[WeClone] I | 20:01:06 | 不启用数据清洗功能
[WeClone] I | 20:01:06 | 已更新 dataset_info.json 中的 file_name 为 sft-my.json
[WeClone] I | 20:01:06 | 微调配置:
{
"stage": "sft",
"dataset": "wechat-sft",
"dataset_dir": "./dataset/res_csv/sft",
"use_fast_tokenizer": true,
"lora_target": "q_proj,v_proj",
"lora_rank": 4,
"lora_dropout": 0.3,
"weight_decay": 0.1,
"overwrite_cache": true,
"per_device_train_batch_size": 8,
"gradient_accumulation_steps": 4,
"lr_scheduler_type": "cosine",
"cutoff_len": 256,
"logging_steps": 10,
"save_steps": 100,
"learning_rate": 0.0001,
"warmup_ratio": 0.1,
"num_train_epochs": 2,
"plot_loss": true,
"fp16": true,
"flash_attn": "fa2",
"model_name_or_path": "/home/overcloud/dev/WeClone/Qwen2.5-7B-Instruct",
"template": "qwen",
"default_system": "请你扮演一名人类,不要说自己是人工智能",
"finetuning_type": "lora",
"trust_remote_code": true,
"output_dir": "./model_output",
"do_train": true
}
[INFO|2025-06-09 20:01:06] llamafactory.hparams.parser:401 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: False, compute dtype: torch.float16
Traceback (most recent call last):
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 470, in cached_files
hf_hub_download(
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/overcloud/dev/WeClone/Qwen2.5-7B-Instruct'. Use repo_type argument if needed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/model/loader.py", line 82, in load_tokenizer
tokenizer = AutoTokenizer.from_pretrained(
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 950, in from_pretrained
tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs)
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 782, in get_tokenizer_config
resolved_config_file = cached_file(
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 312, in cached_file
file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 522, in cached_files
resolved_files = [
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 523, in repo_type argument if needed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 470, in cached_files
hf_hub_download(
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/overcloud/dev/WeClone/Qwen2.5-7B-Instruct'. Use repo_type argument if needed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/overcloud/dev/WeClone/.venv/bin/weclone-cli", line 10, in repo_type argument if needed.
求解,卡住两天了。 我将截断文字数量改为512了: settings.jsonc中我做了如下修改combine_msg_max_length → 512(或更高)
cutoff_len → 512
per_device_train_batch_size → 1
gradient_accumulation_steps → 16
其他环境信息:| NVIDIA-SMI 575.55.01 Driver Version: 576.40 CUDA Version: 12.9 $ python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" CUDA available: True
模型没下载完整吗
模型没下载完整吗
请问如何检验呢? 如果是没下载完整,那么就是卸载重下就行对吧
用你下载的指令继续下就是
用你下载的指令继续下就是
我人在国外所以用的huggingface下载的,我还想请问settings中要对应修改什么吗?
不用
谢谢您,其实我已经下载了,是下载路径错了,我在settings中修改模型路径就可以运行了。 但是现在还有另一个问题,控制台显示: (.venv) overcloud@OvercloudPC:~/dev/WeClone$ weclone-cli train-sft [WeClone] I | 23:13:08 | Loading configuration from: ./settings.jsonc /home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/torch/cuda/init.py:235: UserWarning: NVIDIA GeForce RTX 5070 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90. If you want to use the NVIDIA GeForce RTX 5070 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
warnings.warn( [WeClone] I | 23:13:14 | Loading configuration from: ./settings.jsonc [WeClone] I | 23:13:14 | Loading configuration from: ./settings.jsonc [WeClone] I | 23:13:14 | 不启用数据清洗功能 [WeClone] I | 23:13:14 | 已更新 dataset_info.json 中的 file_name 为 sft-my.json [WeClone] I | 23:13:14 | 微调配置: { "stage": "sft", "dataset": "wechat-sft", "dataset_dir": "./dataset/res_csv/sft", "use_fast_tokenizer": true, "lora_target": "q_proj,v_proj", "lora_rank": 4, "lora_dropout": 0.3, "weight_decay": 0.1, "overwrite_cache": true, "per_device_train_batch_size": 8, "gradient_accumulation_steps": 4, "lr_scheduler_type": "cosine", "cutoff_len": 256, "logging_steps": 10, "save_steps": 100, "learning_rate": 0.0001, "warmup_ratio": 0.1, "num_train_epochs": 2, "plot_loss": true, "fp16": true, "flash_attn": "fa2", "model_name_or_path": "/home/overcloud/models/Qwen2.5-7B-Instruct", "template": "qwen", "default_system": "请你扮演一名人类,不要说自己是人工智能", "finetuning_type": "lora", "trust_remote_code": true, "output_dir": "./model_output", "do_train": true } [INFO|2025-06-09 23:13:14] llamafactory.hparams.parser:401 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: False, compute dtype: torch.float16 [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-06-09 23:13:14,455 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:696] 2025-06-09 23:13:14,455 >> loading configuration file /home/overcloud/models/Qwen2.5-7B-Instruct/config.json [INFO|configuration_utils.py:770] 2025-06-09 23:13:14,457 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.52.1", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }
[INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-06-09 23:13:14,604 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|2025-06-09 23:13:14] llamafactory.data.template:143 >> Using default system message: 请你扮演一名人类,不要说自己 是人工智能. [INFO|2025-06-09 23:13:14] llamafactory.data.loader:143 >> Loading dataset sft-my.json... Converting format of dataset: 100%|█████████████████████████████████████| 31815/31815 [00:00<00:00, 43043.83 examples/s] Running tokenizer on dataset: 100%|█████████████████████████████████████| 31815/31815 [00:02<00:00, 11010.92 examples/s] training example: input_ids: [151644, 8948, 198, 112720, 102889, 101177, 103971, 3837, 100148, 111403, 20412, 104455, 151645, 198, 151644, 872, 198, 114399, 3837, 111596, 108179, 100003, 151645, 198, 151644, 77091, 198, 35946, 99744, 22243, 3837, 41321, 113867, 106065, 3837, 58, 100868, 1457, 100868, 60, 151645, 198] inputs: <|im_start|>system 请你扮演一名人类,不要说自己是人工智能<|im_end|> <|im_start|>user 哈哈哈,那你试试吧<|im_end|> <|im_start|>assistant 我也不行,试了好几次,[汗][汗]<|im_end|>
label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 35946, 99744, 22243, 3837, 41321, 113867, 106065, 3837, 58, 100868, 1457, 100868, 60, 151645, 198] labels: 我也不行,试了好几次,[汗][汗]<|im_end|>
[INFO|configuration_utils.py:696] 2025-06-09 23:13:18,718 >> loading configuration file /home/overcloud/models/Qwen2.5-7B-Instruct/config.json [INFO|configuration_utils.py:770] 2025-06-09 23:13:18,718 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.52.1", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }
[INFO|2025-06-09 23:13:18] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training. [INFO|modeling_utils.py:1146] 2025-06-09 23:13:18,903 >> loading weights file /home/overcloud/models/Qwen2.5-7B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:2239] 2025-06-09 23:13:18,903 >> Instantiating Qwen2ForCausalLM model under default dtype torch.float16. [INFO|configuration_utils.py:1135] 2025-06-09 23:13:18,905 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645, "use_cache": false }
Loading checkpoint shards: 0%| | 0/4 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/overcloud/dev/WeClone/.venv/bin/weclone-cli", line 10, in TORCH_USE_CUDA_DSA to enable device-side assertions.
我询问ai,他告诉我原因是我的5070ti显卡为 Ada Lovelace 架构,sm_120,而当前 PyTorch 版本并没有编译支持 sm_120,因此训练时无法运行模型。请问的确是这样吗?
50系显卡使用最新nightly128版本亲测可用,官方网站:https://pytorch.org/get-started/locally/#start-locally 注意使用uv指令安装