WeClone icon indicating copy to clipboard operation
WeClone copied to clipboard

单卡训练weclone-cli train-sft后出现问题

Open OvercloudX opened this issue 6 months ago • 4 comments

环境如下:

系统:WSL2 + Ubuntu

显卡:RTX 5070 Ti

CUDA:12.1

Python:3.10(用 .venv 和 uv 管理)

控制台显示: (.venv) overcloud@OvercloudPC:~/dev/WeClone$ weclone-cli train-sft [WeClone] I | 20:01:00 | Loading configuration from: ./settings.jsonc /home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/torch/cuda/init.py:235: UserWarning: NVIDIA GeForce RTX 5070 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90. If you want to use the NVIDIA GeForce RTX 5070 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn( [WeClone] I | 20:01:06 | Loading configuration from: ./settings.jsonc [WeClone] I | 20:01:06 | Loading configuration from: ./settings.jsonc [WeClone] I | 20:01:06 | 不启用数据清洗功能 [WeClone] I | 20:01:06 | 已更新 dataset_info.json 中的 file_name 为 sft-my.json [WeClone] I | 20:01:06 | 微调配置: { "stage": "sft", "dataset": "wechat-sft", "dataset_dir": "./dataset/res_csv/sft", "use_fast_tokenizer": true, "lora_target": "q_proj,v_proj", "lora_rank": 4, "lora_dropout": 0.3, "weight_decay": 0.1, "overwrite_cache": true, "per_device_train_batch_size": 8, "gradient_accumulation_steps": 4, "lr_scheduler_type": "cosine", "cutoff_len": 256, "logging_steps": 10, "save_steps": 100, "learning_rate": 0.0001, "warmup_ratio": 0.1, "num_train_epochs": 2, "plot_loss": true, "fp16": true, "flash_attn": "fa2", "model_name_or_path": "/home/overcloud/dev/WeClone/Qwen2.5-7B-Instruct", "template": "qwen", "default_system": "请你扮演一名人类,不要说自己是人工智能", "finetuning_type": "lora", "trust_remote_code": true, "output_dir": "./model_output", "do_train": true } [INFO|2025-06-09 20:01:06] llamafactory.hparams.parser:401 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: False, compute dtype: torch.float16 Traceback (most recent call last): File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 470, in cached_files hf_hub_download( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/overcloud/dev/WeClone/Qwen2.5-7B-Instruct'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/model/loader.py", line 82, in load_tokenizer tokenizer = AutoTokenizer.from_pretrained( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 950, in from_pretrained tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 782, in get_tokenizer_config resolved_config_file = cached_file( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 312, in cached_file file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 522, in cached_files resolved_files = [ File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 523, in _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 140, in _get_cache_file_to_return resolved_file = try_to_load_from_cache(path_or_repo_id, full_filename, cache_dir=cache_dir, revision=revision) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/overcloud/dev/WeClone/Qwen2.5-7B-Instruct'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 470, in cached_files hf_hub_download( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/overcloud/dev/WeClone/Qwen2.5-7B-Instruct'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/overcloud/dev/WeClone/.venv/bin/weclone-cli", line 10, in sys.exit(cli()) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1442, in call return self.main(*args, **kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1363, in main rv = self.invoke(ctx) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1830, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1226, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 794, in invoke return callback(*args, **kwargs) File "/home/overcloud/dev/WeClone/weclone/cli.py", line 30, in wrapper return func(*args, **kwargs) File "/home/overcloud/dev/WeClone/weclone/cli.py", line 48, in new_runtime_wrapper return original_cmd_func(*args, **kwargs) File "/home/overcloud/dev/WeClone/weclone/cli.py", line 91, in train_sft train_sft_main() File "/home/overcloud/dev/WeClone/weclone/train/train_sft.py", line 28, in main run_exp(train_config) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 110, in run_exp _training_function(config={"args": args, "callbacks": callbacks}) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 72, in _training_function run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/sft/workflow.py", line 48, in run_sft tokenizer_module = load_tokenizer(model_args) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/model/loader.py", line 90, in load_tokenizer tokenizer = AutoTokenizer.from_pretrained( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 950, in from_pretrained tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, **kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 782, in get_tokenizer_config resolved_config_file = cached_file( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 312, in cached_file file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 522, in cached_files resolved_files = [ File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 523, in _get_cache_file_to_return(path_or_repo_id, filename, cache_dir, revision) for filename in full_filenames File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/utils/hub.py", line 140, in _get_cache_file_to_return resolved_file = try_to_load_from_cache(path_or_repo_id, full_filename, cache_dir=cache_dir, revision=revision) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn validate_repo_id(arg_value) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 154, in validate_repo_id raise HFValidationError( huggingface_hub.errors.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home/overcloud/dev/WeClone/Qwen2.5-7B-Instruct'. Use repo_type argument if needed.

OvercloudX avatar Jun 09 '25 12:06 OvercloudX

求解,卡住两天了。 我将截断文字数量改为512了: settings.jsonc中我做了如下修改combine_msg_max_length → 512(或更高)

cutoff_len → 512

per_device_train_batch_size → 1

gradient_accumulation_steps → 16

其他环境信息:| NVIDIA-SMI 575.55.01 Driver Version: 576.40 CUDA Version: 12.9 $ python3 -c "import torch; print(f'CUDA available: {torch.cuda.is_available()}')" CUDA available: True

OvercloudX avatar Jun 09 '25 12:06 OvercloudX

模型没下载完整吗

xming521 avatar Jun 09 '25 12:06 xming521

模型没下载完整吗

请问如何检验呢? 如果是没下载完整,那么就是卸载重下就行对吧

OvercloudX avatar Jun 09 '25 12:06 OvercloudX

用你下载的指令继续下就是

xming521 avatar Jun 09 '25 12:06 xming521

用你下载的指令继续下就是

我人在国外所以用的huggingface下载的,我还想请问settings中要对应修改什么吗?

OvercloudX avatar Jun 09 '25 14:06 OvercloudX

不用

xming521 avatar Jun 09 '25 14:06 xming521

谢谢您,其实我已经下载了,是下载路径错了,我在settings中修改模型路径就可以运行了。 但是现在还有另一个问题,控制台显示: (.venv) overcloud@OvercloudPC:~/dev/WeClone$ weclone-cli train-sft [WeClone] I | 23:13:08 | Loading configuration from: ./settings.jsonc /home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/torch/cuda/init.py:235: UserWarning: NVIDIA GeForce RTX 5070 Ti with CUDA capability sm_120 is not compatible with the current PyTorch installation. The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90. If you want to use the NVIDIA GeForce RTX 5070 Ti GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/

warnings.warn( [WeClone] I | 23:13:14 | Loading configuration from: ./settings.jsonc [WeClone] I | 23:13:14 | Loading configuration from: ./settings.jsonc [WeClone] I | 23:13:14 | 不启用数据清洗功能 [WeClone] I | 23:13:14 | 已更新 dataset_info.json 中的 file_name 为 sft-my.json [WeClone] I | 23:13:14 | 微调配置: { "stage": "sft", "dataset": "wechat-sft", "dataset_dir": "./dataset/res_csv/sft", "use_fast_tokenizer": true, "lora_target": "q_proj,v_proj", "lora_rank": 4, "lora_dropout": 0.3, "weight_decay": 0.1, "overwrite_cache": true, "per_device_train_batch_size": 8, "gradient_accumulation_steps": 4, "lr_scheduler_type": "cosine", "cutoff_len": 256, "logging_steps": 10, "save_steps": 100, "learning_rate": 0.0001, "warmup_ratio": 0.1, "num_train_epochs": 2, "plot_loss": true, "fp16": true, "flash_attn": "fa2", "model_name_or_path": "/home/overcloud/models/Qwen2.5-7B-Instruct", "template": "qwen", "default_system": "请你扮演一名人类,不要说自己是人工智能", "finetuning_type": "lora", "trust_remote_code": true, "output_dir": "./model_output", "do_train": true } [INFO|2025-06-09 23:13:14] llamafactory.hparams.parser:401 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: False, compute dtype: torch.float16 [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,289 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-06-09 23:13:14,455 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:696] 2025-06-09 23:13:14,455 >> loading configuration file /home/overcloud/models/Qwen2.5-7B-Instruct/config.json [INFO|configuration_utils.py:770] 2025-06-09 23:13:14,457 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.52.1", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file vocab.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file merges.txt [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2021] 2025-06-09 23:13:14,458 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2299] 2025-06-09 23:13:14,604 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|2025-06-09 23:13:14] llamafactory.data.template:143 >> Using default system message: 请你扮演一名人类,不要说自己 是人工智能. [INFO|2025-06-09 23:13:14] llamafactory.data.loader:143 >> Loading dataset sft-my.json... Converting format of dataset: 100%|█████████████████████████████████████| 31815/31815 [00:00<00:00, 43043.83 examples/s] Running tokenizer on dataset: 100%|█████████████████████████████████████| 31815/31815 [00:02<00:00, 11010.92 examples/s] training example: input_ids: [151644, 8948, 198, 112720, 102889, 101177, 103971, 3837, 100148, 111403, 20412, 104455, 151645, 198, 151644, 872, 198, 114399, 3837, 111596, 108179, 100003, 151645, 198, 151644, 77091, 198, 35946, 99744, 22243, 3837, 41321, 113867, 106065, 3837, 58, 100868, 1457, 100868, 60, 151645, 198] inputs: <|im_start|>system 请你扮演一名人类,不要说自己是人工智能<|im_end|> <|im_start|>user 哈哈哈,那你试试吧<|im_end|> <|im_start|>assistant 我也不行,试了好几次,[汗][汗]<|im_end|>

label_ids: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 35946, 99744, 22243, 3837, 41321, 113867, 106065, 3837, 58, 100868, 1457, 100868, 60, 151645, 198] labels: 我也不行,试了好几次,[汗][汗]<|im_end|>

[INFO|configuration_utils.py:696] 2025-06-09 23:13:18,718 >> loading configuration file /home/overcloud/models/Qwen2.5-7B-Instruct/config.json [INFO|configuration_utils.py:770] 2025-06-09 23:13:18,718 >> Model config Qwen2Config { "architectures": [ "Qwen2ForCausalLM" ], "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "hidden_act": "silu", "hidden_size": 3584, "initializer_range": 0.02, "intermediate_size": 18944, "max_position_embeddings": 32768, "max_window_layers": 28, "model_type": "qwen2", "num_attention_heads": 28, "num_hidden_layers": 28, "num_key_value_heads": 4, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000.0, "sliding_window": 131072, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.52.1", "use_cache": true, "use_sliding_window": false, "vocab_size": 152064 }

[INFO|2025-06-09 23:13:18] llamafactory.model.model_utils.kv_cache:143 >> KV cache is disabled during training. [INFO|modeling_utils.py:1146] 2025-06-09 23:13:18,903 >> loading weights file /home/overcloud/models/Qwen2.5-7B-Instruct/model.safetensors.index.json [INFO|modeling_utils.py:2239] 2025-06-09 23:13:18,903 >> Instantiating Qwen2ForCausalLM model under default dtype torch.float16. [INFO|configuration_utils.py:1135] 2025-06-09 23:13:18,905 >> Generate config GenerationConfig { "bos_token_id": 151643, "eos_token_id": 151645, "use_cache": false }

Loading checkpoint shards: 0%| | 0/4 [00:01<?, ?it/s] Traceback (most recent call last): File "/home/overcloud/dev/WeClone/.venv/bin/weclone-cli", line 10, in sys.exit(cli()) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1442, in call return self.main(*args, **kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1363, in main rv = self.invoke(ctx) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1830, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 1226, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/click/core.py", line 794, in invoke return callback(*args, **kwargs) File "/home/overcloud/dev/WeClone/weclone/cli.py", line 30, in wrapper return func(*args, **kwargs) File "/home/overcloud/dev/WeClone/weclone/cli.py", line 48, in new_runtime_wrapper return original_cmd_func(*args, **kwargs) File "/home/overcloud/dev/WeClone/weclone/cli.py", line 91, in train_sft train_sft_main() File "/home/overcloud/dev/WeClone/weclone/train/train_sft.py", line 28, in main run_exp(train_config) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 110, in run_exp _training_function(config={"args": args, "callbacks": callbacks}) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/tuner.py", line 72, in _training_function run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/train/sft/workflow.py", line 52, in run_sft model = load_model(tokenizer, model_args, finetuning_args, training_args.do_train) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/llamafactory/model/loader.py", line 167, in load_model model = load_class.from_pretrained(**init_kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 571, in from_pretrained return model_class.from_pretrained( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 308, in _wrapper return func(*args, **kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4613, in from_pretrained ) = cls._load_pretrained_model( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 5070, in _load_pretrained_model disk_offload_index, cpu_offload_index = _load_state_dict_into_meta_model( File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/home/overcloud/dev/WeClone/.venv/lib/python3.10/site-packages/transformers/modeling_utils.py", line 806, in _load_state_dict_into_meta_model param = param.to(casting_dtype) RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

我询问ai,他告诉我原因是我的5070ti显卡为 Ada Lovelace 架构,sm_120,而当前 PyTorch 版本并没有编译支持 sm_120,因此训练时无法运行模型。请问的确是这样吗?

OvercloudX avatar Jun 09 '25 15:06 OvercloudX

50系显卡使用最新nightly128版本亲测可用,官方网站:https://pytorch.org/get-started/locally/#start-locally 注意使用uv指令安装

OvercloudX avatar Jun 09 '25 23:06 OvercloudX