Training gets stuck after completing just one trajectory

Open abd-hsn opened this issue 3 months ago • 0 comments

I’m running a customized version of art-e with additional tools locally. Training proceeds normally until the first validation stage, where it gets stuck for hours.

I am running Qwen2.5-1.5B-Instruct on single A100. Below is the code I use to load the model.

    from art.local.backend import LocalBackend

    backend = LocalBackend(path="./.art")

    model = art.TrainableModel(
        name="Qwen2.5-1.5B",
        project="my-model",
        base_model="Qwen/Qwen2.5-1.5B-Instruct",
    )

    model._internal_config = art.dev.InternalModelConfig(
        init_args=art.dev.InitArgs(
            max_seq_length=4096,
        ),
        peft_args=art.dev.PeftArgs(
            r=8,
            lora_alpha=8,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        ),
        trainer_args=art.dev.TrainerArgs(
            per_device_train_batch_size=1,
            gradient_accumulation_steps=2,
        ),
        engine_args=art.dev.EngineArgs(
            gpu_memory_utilization=0.8,
            enforce_eager=True,
        ),
    )
    await model.register(backend)

The model shows this and stuck

    train:   0%|          | 0/3 [00:00<?, ?it/s][A==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
       \\   /|    Num examples = 10,000,000 | Num Epochs = 3 | Total steps = 15,000,000
    O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 2
    \        /    Data Parallel GPUs = 1 | Total batch size (2 x 2 x 1) = 4
     "-____-"     Trainable parameters = 2,179,072 of 1,545,893,376 (0.14% trained)

I’ve been monitoring VLLM.log, and it’s showing the same entries. From VLLM.log:

INFO:     127.0.0.1:46160 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46166 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46172 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46178 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46184 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46190 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46196 - "GET /metrics HTTP/1.1" 200
INFO:     127.0.0.1:46204 - "GET /metrics HTTP/1.1" 200

From debug_internal.log:

{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"request failed","error":"Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused","method":"GET","url":"http://localhost:8000/metrics"}
{"time":"2025-11","level":"ERROR","msg":"monitor: error sampling metrics: GET http://localhost:8000/metrics giving up after 4 attempt(s): Get \"http://localhost:8000/metrics\": dial tcp [::1]:8000: connect: connection refused"}

Similar issue to 329, I tried installing the latest version from GitHub, but the issue still persists. pointers or examples would be greatly appreciated!

Nov 18 '25 12:11 abd-hsn