计算v1_dpo_demo.jsonl 文件中的maxtoken报错
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
PRETTY_NAME="Ubuntu 24.04 LTS" Linux 10-60-219-107 6.8.0-85-generic #85-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 18 15:26:59 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux Python 3.10.0 PyTorch version: 2.8.0+cu128 NVIDIA GeForce RTX 4090, 2 NVIDIA GeForce RTX 4090, 2 49140 MiB 49140 MiB
Reproduction
"v1_dpo_demo": {
"file_name": "v1_dpo_demo.jsonl",
"ranking": true,
"formatting": "sharegpt",
"columns": {
"chosen": "chosen_messages",
"rejected": "rejected_messages"
}
}
"v1_dpo_demo": {
"file_name": "v1_dpo_demo.jsonl",
"ranking": true,
"stage": "dpo",
"formatting": "sharegpt",
"columns": {
"messages": "chosen_messages",
"chosen": "chosen_messages",
"rejected": "rejected_messages"
}
}
执行计算token命令如下: torchrun scripts/stat_utils/length_cdf.py /home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/b968826d9c46dd6066d109eabc6255188de91218 --dataset v1_dpo_demo --dataset_dir data --template qwen3 --stage dpo
报错信息如下:
torchrun scripts/stat_utils/length_cdf.py /home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/b968826d9c46dd6066d109eabc6255188de91218 --dataset v1_dpo_demo --dataset_dir data --template qwen3 --stage dpo
[WARNING|2025-11-11 14:03:47] llamafactory.hparams.parser:148 >> We recommend enable mixed precision training.
[INFO|2025-11-11 14:03:47] llamafactory.hparams.parser:143 >> Set ddp_find_unused_parameters to False in DDP training since LoRA is enabled.
[INFO|2025-11-11 14:03:47] llamafactory.hparams.parser:455 >> Process rank: 0, world size: 1, device: cuda:0, distributed training: True, compute dtype: None
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,033 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,033 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,033 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,033 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,033 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,033 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,033 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2336] 2025-11-11 14:03:47,369 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[INFO|configuration_utils.py:750] 2025-11-11 14:03:47,369 >> loading configuration file /home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/b968826d9c46dd6066d109eabc6255188de91218/config.json
[INFO|configuration_utils.py:817] 2025-11-11 14:03:47,371 >> Model config Qwen3Config {
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 12288,
"layer_types": [
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention",
"full_attention"
],
"max_position_embeddings": 40960,
"max_window_layers": 36,
"model_type": "qwen3",
"num_attention_heads": 32,
"num_hidden_layers": 36,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.55.0",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 151936
}
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,372 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,372 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,372 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,372 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,372 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,372 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:03:47,372 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2336] 2025-11-11 14:03:47,725 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntu/my/LLaMA-Factory/scripts/stat_utils/length_cdf.py", line 74, in
[rank0]: fire.Fire(length_cdf)
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
[rank0]: component_trace = _Fire(component, args, parsed_flag_args, context, name)
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
[rank0]: component, remaining_args = _CallAndUpdateTrace(
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
[rank0]: component = fn(*varargs, **kwargs)
[rank0]: File "/home/ubuntu/my/LLaMA-Factory/scripts/stat_utils/length_cdf.py", line 54, in length_cdf
[rank0]: trainset = get_dataset(
[rank0]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/data/loader.py", line 304, in get_dataset
[rank0]: dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
[rank0]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/data/loader.py", line 180, in _get_merged_dataset
[rank0]: raise ValueError("The dataset is not applicable in the current training stage.")
[rank0]: ValueError: The dataset is not applicable in the current training stage.
[rank0]:[W1111 14:03:48.824517910 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E1111 14:03:48.541546 334227 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 334262) of binary: /home/ubuntu/miniconda3/envs/factory/bin/python3.10
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/factory/bin/torchrun", line 7, in
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
scripts/stat_utils/length_cdf.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-11-11_14:03:48 host : 10-60-219-107 rank : 0 (local_rank: 0) exitcode : 1 (pid: 334262) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
不知道如何解决了。v1_dpo_demo.jsonl 使用的github中提供的dpo数据集
Others
No response
github 提供的v1_dpo_demo.jsonl 数据集中是将完整的多轮对话作为dpo训练数据集中的一条,按理来说,应该是能跑通的。执行计算maxtokens报错,使用llamafactory webui 使用v1_dpo_demo.jsonl 数据集进行dpo微调时也报错。报错信息如下: W1111 14:43:53.167801 336702 site-packages/torch/distributed/run.py:774] W1111 14:43:53.167801 336702 site-packages/torch/distributed/run.py:774] ***************************************** W1111 14:43:53.167801 336702 site-packages/torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1111 14:43:53.167801 336702 site-packages/torch/distributed/run.py:774] ***************************************** /home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources [W1111 14:43:58.221138406 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator()) [W1111 14:43:58.235221952 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator()) [INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:58,688 >> loading file vocab.json [INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:58,688 >> loading file merges.txt [INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:58,688 >> loading file tokenizer.json [INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:58,688 >> loading file added_tokens.json [INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:58,688 >> loading file special_tokens_map.json [INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:58,689 >> loading file tokenizer_config.json [INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:58,689 >> loading file chat_template.jinja [INFO|tokenization_utils_base.py:2336] 2025-11-11 14:43:59,066 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. [INFO|configuration_utils.py:750] 2025-11-11 14:43:59,066 >> loading configuration file /home/ubuntu/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/b968826d9c46dd6066d109eabc6255188de91218/config.json [INFO|configuration_utils.py:817] 2025-11-11 14:43:59,069 >> Model config Qwen3Config { "architectures": [ "Qwen3ForCausalLM" ], "attention_bias": false, "attention_dropout": 0.0, "bos_token_id": 151643, "eos_token_id": 151645, "head_dim": 128, "hidden_act": "silu", "hidden_size": 4096, "initializer_range": 0.02, "intermediate_size": 12288, "layer_types": [ "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention", "full_attention" ], "max_position_embeddings": 40960, "max_window_layers": 36, "model_type": "qwen3", "num_attention_heads": 32, "num_hidden_layers": 36, "num_key_value_heads": 8, "rms_norm_eps": 1e-06, "rope_scaling": null, "rope_theta": 1000000, "sliding_window": null, "tie_word_embeddings": false, "torch_dtype": "bfloat16", "transformers_version": "4.55.0", "use_cache": true, "use_sliding_window": false, "vocab_size": 151936 }
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:59,070 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:59,070 >> loading file merges.txt
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:59,070 >> loading file tokenizer.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:59,070 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:59,070 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:59,070 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2065] 2025-11-11 14:43:59,070 >> loading file chat_template.jinja
[INFO|tokenization_utils_base.py:2336] 2025-11-11 14:43:59,469 >> Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via init_process_group or barrier . Using the current device set by the user.
warnings.warn( # warn only once
[rank1]:[W1111 14:43:59.257871720 ProcessGroupNCCL.cpp:5023] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
num_proc must be <= 10. Reducing num_proc to 10 for dataset of size 10.
Converting format of dataset (num_proc=10): 0%| | 0/10 [00:00<?, ? examples/s]
Converting format of dataset (num_proc=10): 0%| | 0/10 [00:00<?, ? examples/s]
/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py:4807: UserWarning: No device id is provided via init_process_group or barrier . Using the current device set by the user.
warnings.warn( # warn only once
[rank0]:[W1111 14:44:05.535086009 ProcessGroupNCCL.cpp:5023] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank0]: multiprocess.pool.RemoteTraceback:
[rank0]: """
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
[rank0]: result = (True, func(*args, **kwds))
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 688, in _write_generator_to_queue
[rank0]: for i, result in enumerate(func(**kwargs)):
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3501, in _map_single
[rank0]: for i, example in iter_outputs(shard_iterable):
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3475, in iter_outputs
[rank0]: yield i, apply_function(example, i, offset=offset)
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3398, in apply_function
[rank0]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
[rank0]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/data/converter.py", line 147, in call
[rank0]: messages = example[self.dataset_attr.messages]
[rank0]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 278, in getitem
[rank0]: value = self.data[key]
[rank0]: KeyError: None
[rank0]: """
[rank0]: The above exception was the direct cause of the following exception:
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/launcher.py", line 184, in
Converting format of dataset (num_proc=10): 0%| | 0/10 [00:00<?, ? examples/s] Converting format of dataset (num_proc=10): 0%| | 0/10 [00:00<?, ? examples/s] [rank1]: multiprocess.pool.RemoteTraceback: [rank1]: """ [rank1]: Traceback (most recent call last): [rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker [rank1]: result = (True, func(*args, **kwds)) [rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 688, in _write_generator_to_queue [rank1]: for i, result in enumerate(func(**kwargs)): [rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3501, in _map_single [rank1]: for i, example in iter_outputs(shard_iterable): [rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3475, in iter_outputs [rank1]: yield i, apply_function(example, i, offset=offset) [rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3398, in apply_function [rank1]: processed_inputs = function(*fn_args, *additional_args, **fn_kwargs) [rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/data/converter.py", line 147, in call [rank1]: messages = example[self.dataset_attr.messages] [rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 278, in getitem [rank1]: value = self.data[key] [rank1]: KeyError: None [rank1]: """
[rank1]: The above exception was the direct cause of the following exception:
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/launcher.py", line 184, in
[rank1]: run_exp()
[rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/train/tuner.py", line 122, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/train/tuner.py", line 90, in _training_function
[rank1]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 46, in run_dpo
[rank1]: dataset_module = get_dataset(template, model_args, data_args, training_args, stage="rm", **tokenizer_module)
[rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/data/loader.py", line 304, in get_dataset
[rank1]: dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
[rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/data/loader.py", line 182, in _get_merged_dataset
[rank1]: datasets[dataset_name] = _load_single_dataset(dataset_attr, model_args, data_args, training_args)
[rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/data/loader.py", line 162, in _load_single_dataset
[rank1]: return align_dataset(dataset, dataset_attr, data_args, training_args)
[rank1]: File "/home/ubuntu/my/LLaMA-Factory/src/llamafactory/data/converter.py", line 420, in align_dataset
[rank1]: return dataset.map(
[rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
[rank1]: out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
[rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3171, in map
[rank1]: for rank, done, content in iflatmap_unordered(
[rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 728, in iflatmap_unordered
[rank1]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 728, in
[rank1]: [async_result.get(timeout=0.05) for async_result in async_results]
[rank1]: File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
[rank1]: raise self._value
[rank1]: KeyError: None
W1111 14:44:07.694762 336702 site-packages/torch/distributed/elastic/multiprocessing/api.py:900] Sending process 336738 closing signal SIGTERM
E1111 14:44:07.809056 336702 site-packages/torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 336737) of binary: /home/ubuntu/miniconda3/envs/factory/bin/python3.10
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/factory/bin/torchrun", line 7, in
sys.exit(main())
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ubuntu/miniconda3/envs/factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/ubuntu/my/LLaMA-Factory/src/llamafactory/launcher.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-11-11_14:44:07 host : 10-60-219-107 rank : 0 (local_rank: 0) exitcode : 1 (pid: 336737) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/factory/bin/llamafactory-cli", line 7, in
请教下大佬如何解决