LLMs Fine-Tuning Errors in llm worker pod with pytorch conatiner
Hi Team, I follow the instruction to run the LLM fine tuning, but faces below errors.
It comes from
kubectl logs pod llama-ppwtq5t2-worker-0 -n
!pip install -U kubeflow-katib Successfully installed kubeflow-katib-0.18.0
!pip install -U "kubeflow-training[huggingface]" Successfully installed peft-0.15.1 tokenizers-0.21.4 transformers-4.50.2
Here is the detailed errors:
2025-10-31T03:31:21Z INFO Starting HuggingFace LLM Trainer
Traceback (most recent call last):
File "/app/hf_llm_training.py", line 188, in
**train_args = TrainingArguments(json.loads(args.training_parameters))
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E1031 03:31:22.800000 140614705968960 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 52) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0a0+f70bd71a48.nv24.6', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
hf_llm_training.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2025-10-31_03:31:22 host : llama-ppwtq5t2-worker-0 rank : 1 (local_rank: 0) exitcode : 1 (pid: 52) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
/assign
Hi! I’m new to open source and would love to work on this issue as my first contribution. I’ve read through the discussion and understood that the problem is due to invalid or empty training_parameters JSON. Could I please be assigned to this issue? I’ll start by reproducing the error and then propose a fix to handle and validate this parameter
Absolutely, thanks for talking a look at this @RamanAsolekar! /kind bug
@RamanAsolekar Brother its been a week do u mind if try resolving the issue.
/unassign
/assign
I would appreciate to have a look and make a contribution to the problem here.