[Help] Fail to fine tune Qwen3-4B-Instruct-2507 with Lora using llama-factory v0.9.4.dev0

Open Bob123Yang opened this issue 1 month ago • 0 comments

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

I use Llama-factory v0.9.4.dev0 to fine tune Qwen3-4B-Instruct-2507 with Lora, but failed with the below log. I did the same training about three weeks before and succeeded.

Error log as below:

W1113 09:26:42.977000 46121 torch/distributed/run.py:774] W1113 09:26:42.977000 46121 torch/distributed/run.py:774] ***************************************** W1113 09:26:42.977000 46121 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W1113 09:26:42.977000 46121 torch/distributed/run.py:774] ***************************************** /home/test/.local/lib/python3.10/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /home/test/.local/lib/python3.10/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources /home/test/.local/lib/python3.10/site-packages/jieba/_compat.py:18: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81. import pkg_resources Traceback (most recent call last): File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 258, in _add_dataclass_arguments Traceback (most recent call last): File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 258, in _add_dataclass_arguments type_hints: dict[str, type] = get_type_hints(dtype) File "/usr/lib/python3.10/typing.py", line 1833, in get_type_hints type_hints: dict[str, type] = get_type_hints(dtype) File "/usr/lib/python3.10/typing.py", line 1833, in get_type_hints value = _eval_type(value, base_globals, base_locals) File "/usr/lib/python3.10/typing.py", line 329, in _eval_type ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args) File "/usr/lib/python3.10/typing.py", line 329, in ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args) File "/usr/lib/python3.10/typing.py", line 327, in _eval_type value = _eval_type(value, base_globals, base_locals) File "/usr/lib/python3.10/typing.py", line 329, in _eval_type return t._evaluate(globalns, localns, recursive_guard) File "/usr/lib/python3.10/typing.py", line 694, in _evaluate eval(self.forward_code, globalns, localns), File "", line 1, in ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args) File "/usr/lib/python3.10/typing.py", line 329, in NameError: name 'ParallelismConfig' is not defined ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args) File "/usr/lib/python3.10/typing.py", line 327, in _eval_type . Did you mean: 'parallelism_config'?

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/test/LLaMA-Factory/src/llamafactory/launcher.py", line 180, in run_exp() File "/home/test/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp return t._evaluate(globalns, localns, recursive_guard) File "/usr/lib/python3.10/typing.py", line 694, in _evaluate _training_function(config={"args": args, "callbacks": callbacks}) File "/home/test/LLaMA-Factory/src/llamafactory/train/tuner.py", line 55, in _training_function model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/home/test/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 219, in get_train_args eval(self.forward_code, globalns, localns), File "", line 1, in NameError: name 'ParallelismConfig' is not defined. Did you mean: 'parallelism_config'?

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/test/LLaMA-Factory/src/llamafactory/launcher.py", line 180, in run_exp() File "/home/test/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp _training_function(config={"args": args, "callbacks": callbacks}) File "/home/test/LLaMA-Factory/src/llamafactory/train/tuner.py", line 55, in _training_function model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/home/test/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 219, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args) File "/home/test/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 195, in _parse_train_args model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args) File "/home/test/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 195, in _parse_train_args parser = HfArgumentParser(_TRAIN_ARGS) File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 143, in init parser = HfArgumentParser(_TRAIN_ARGS) File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 143, in init self._add_dataclass_arguments(dtype) File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 260, in _add_dataclass_arguments self._add_dataclass_arguments(dtype) File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 260, in _add_dataclass_arguments raise RuntimeError( RuntimeError: Type resolution failed for <class 'llamafactory.hparams.training_args.TrainingArguments'>. Try declaring the class in global scope or removing line of from __future__ import annotations which opts in Postponed Evaluation of Annotations (PEP 563)
raise RuntimeError( RuntimeError: Type resolution failed for <class 'llamafactory.hparams.training_args.TrainingArguments'>. Try declaring the class in global scope or removing line of from __future__ import annotations which opts in Postponed Evaluation of Annotations (PEP 563) Traceback (most recent call last): File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 258, in _add_dataclass_arguments type_hints: dict[str, type] = get_type_hints(dtype) File "/usr/lib/python3.10/typing.py", line 1833, in get_type_hints value = _eval_type(value, base_globals, base_locals) File "/usr/lib/python3.10/typing.py", line 329, in _eval_type ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args) File "/usr/lib/python3.10/typing.py", line 329, in ev_args = tuple(_eval_type(a, globalns, localns, recursive_guard) for a in t.args) File "/usr/lib/python3.10/typing.py", line 327, in _eval_type return t._evaluate(globalns, localns, recursive_guard) File "/usr/lib/python3.10/typing.py", line 694, in _evaluate eval(self.forward_code, globalns, localns), File "", line 1, in NameError: name 'ParallelismConfig' is not defined. Did you mean: 'parallelism_config'?

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/test/LLaMA-Factory/src/llamafactory/launcher.py", line 180, in run_exp() File "/home/test/LLaMA-Factory/src/llamafactory/train/tuner.py", line 110, in run_exp _training_function(config={"args": args, "callbacks": callbacks}) File "/home/test/LLaMA-Factory/src/llamafactory/train/tuner.py", line 55, in _training_function model_args, data_args, training_args, finetuning_args, generating_args = get_train_args(args) File "/home/test/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 219, in get_train_args model_args, data_args, training_args, finetuning_args, generating_args = _parse_train_args(args) File "/home/test/LLaMA-Factory/src/llamafactory/hparams/parser.py", line 195, in _parse_train_args parser = HfArgumentParser(_TRAIN_ARGS) File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 143, in init self._add_dataclass_arguments(dtype) File "/home/test/.local/lib/python3.10/site-packages/transformers/hf_argparser.py", line 260, in _add_dataclass_arguments raise RuntimeError( RuntimeError: Type resolution failed for <class 'llamafactory.hparams.training_args.TrainingArguments'>. Try declaring the class in global scope or removing line of `from future import annotations` which opts in Postponed Evaluation of Annotations (PEP 563) W1113 09:26:48.088000 46121 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 46193 closing signal SIGTERM E1113 09:26:48.152000 46121 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 46191) of binary: /usr/bin/python3 Traceback (most recent call last): File "/home/test/.local/bin/torchrun", line 8, in sys.exit(main()) File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper return f(*args, kwargs) File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main run(args) File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 143, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 277, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/test/LLaMA-Factory/src/llamafactory/launcher.py FAILED

Failures: [1]: time : 2025-11-13_09:26:48 host : test-test-Product rank : 1 (local_rank: 1) exitcode : 1 (pid: 46192) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2025-11-13_09:26:48 host : test-test-Product rank : 0 (local_rank: 0) exitcode : 1 (pid: 46191) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Traceback (most recent call last): File "/home/test/.local/bin/llamafactory-cli", line 8, in sys.exit(main()) File "/home/test/LLaMA-Factory/src/llamafactory/cli.py", line 24, in main launcher.launch() File "/home/test/LLaMA-Factory/src/llamafactory/launcher.py", line 110, in launch process = subprocess.run( File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['torchrun', '--nnodes', '1', '--node_rank', '0', '--nproc_per_node', '3', '--master_addr', '127.0.0.1', '--master_port', '42681', '/home/test/LLaMA-Factory/src/llamafactory/launcher.py', 'saves/Qwen3-4B-Instruct-2507/lora/train_2025-11-13-09-06-31/training_args.yaml']' returned non-zero exit status 1.

Reproduction

Put your message here.

Others

No response

Nov 13 '25 01:11 Bob123Yang