ART icon indicating copy to clipboard operation
ART copied to clipboard

training stuck

Open oraby8 opened this issue 7 months ago • 3 comments

i get an issue during training where the process is getting stuck at the gather stage. specifically, at this progress point:

gather: 75%|█████████████████████████████████████████████████████████████████████████████████████▌ | 9/12 [00:24<00:05, 1.79s/it, reward=0, correct=0, completion_tokens=82]

oraby8 avatar Aug 07 '25 10:08 oraby8

I recommend running nvidia-smi to see if vLLM is still running. You can also look at .art/{project}/{model}/logs/vllm.log to get more visibility into what vLLM is doing.

bradhilton avatar Aug 07 '25 16:08 bradhilton

I just pushed something to address the OpenAI-compatible server hanging. Hopefully it will crash instead of getting stuck and you can add retry logic like the following if you like:

for _ in range(RETRIES)
  # register for every try
  await model.register(backend)
  try:
    # train loop, something like this
    for _ in range(await model.get_step(), 1_000):
      train_groups = await art.gather_trajectory_groups(
          (
              art.TrajectoryGroup(rollout(openai_client, prompt) for _ in range(32))
              for prompt in prompts
          ),
          pbar_desc="gather",
      )
      await model.train(
          train_groups
      )
except Exception:
  pass

Not sure if this will address the underlying issue, so would be interested to hear if it helps.

To get the latest version of openpipe-art:

uv add 'git+https://github.com/OpenPipe/ART.git#egg=openpipe-art[backend]'

bradhilton avatar Aug 20 '25 00:08 bradhilton

same issue . i could share with you the code if it will help

oraby8 avatar Aug 24 '25 08:08 oraby8