katib icon indicating copy to clipboard operation
katib copied to clipboard

LLMs Fine-Tuning Errors in llm worker pod with pytorch conatiner

Open skb888 opened this issue 4 months ago • 5 comments

Hi Team, I follow the instruction to run the LLM fine tuning, but faces below errors. It comes from kubectl logs pod llama-ppwtq5t2-worker-0 -n -c pytorch

!pip install -U kubeflow-katib Successfully installed kubeflow-katib-0.18.0

!pip install -U "kubeflow-training[huggingface]" Successfully installed peft-0.15.1 tokenizers-0.21.4 transformers-4.50.2

Here is the detailed errors: 2025-10-31T03:31:21Z INFO Starting HuggingFace LLM Trainer Traceback (most recent call last): File "/app/hf_llm_training.py", line 188, in **train_args = TrainingArguments(json.loads(args.training_parameters)) File "/usr/lib/python3.10/json/init.py", line 346, in loads return _default_decoder.decode(s) File "/usr/lib/python3.10/json/decoder.py", line 337, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0) E1031 03:31:22.800000 140614705968960 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 52) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/torchrun", line 33, in sys.exit(load_entry_point('torch==2.4.0a0+f70bd71a48.nv24.6', 'console_scripts', 'torchrun')()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: hf_llm_training.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-10-31_03:31:22 host : llama-ppwtq5t2-worker-0 rank : 1 (local_rank: 0) exitcode : 1 (pid: 52) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

skb888 avatar Oct 31 '25 03:10 skb888

/assign

Hi! I’m new to open source and would love to work on this issue as my first contribution. I’ve read through the discussion and understood that the problem is due to invalid or empty training_parameters JSON. Could I please be assigned to this issue? I’ll start by reproducing the error and then propose a fix to handle and validate this parameter

RamanAsolekar avatar Oct 31 '25 17:10 RamanAsolekar

Absolutely, thanks for talking a look at this @RamanAsolekar! /kind bug

andreyvelich avatar Oct 31 '25 17:10 andreyvelich

@RamanAsolekar Brother its been a week do u mind if try resolving the issue.

Divyanshu-Off avatar Nov 09 '25 18:11 Divyanshu-Off

/unassign

RamanAsolekar avatar Nov 15 '25 15:11 RamanAsolekar

/assign

I would appreciate to have a look and make a contribution to the problem here.

Divyanshu-Off avatar Nov 15 '25 15:11 Divyanshu-Off