ART icon indicating copy to clipboard operation
ART copied to clipboard

uv venv error initializing art backend on skypilot/aws

Open ecatkins opened this issue 3 months ago โ€ข 4 comments

I imagine this boils down to some change in a dependency - but raising here in case others encounter. I have not been able to identify a fix on my end.

I had the below script working for me over a period of a few weeks to successfully set up a training environment. However overnight, it now produces an error with the uv environment when trying to setup.

Dependencies

openpipe-art[langgraph]>=0.4.11
skypilot[aws]>=0.10.3

Script

import asyncio
import sky
from art.skypilot.backend import SkyPilotBackend
from art import TrainableModel
from art.dev import InternalModelConfig, InitArgs, EngineArgs


async def setup_cluster():
    print("๐Ÿš€ Setting up SkyPilot Cluster for Data SQL Agent")
    print("=" * 50)

    resources = sky.Resources(
        cloud=sky.AWS(),
        accelerators="H100:1",
    )

    print("๐Ÿ”ง Initializing cluster...")
    backend = await SkyPilotBackend.initialize_cluster(
        cluster_name="data-sql-agent-cluster",
        resources=resources,
    )

    print("โœ… Cluster initialized successfully!")
    print(f"๐Ÿ“ก Backend: {backend}")

    print("๐Ÿค– Creating TrainableModel...")
    model = TrainableModel(
        name="data-sql-agent-v1",
        project="data-sql-agent",
        base_model="Qwen/Qwen2.5-7B-Instruct",
        _internal_config=InternalModelConfig(
            init_args=InitArgs(
                max_seq_length=8192,
                enable_prefix_caching=False,
                load_in_4bit=True,
                fast_inference=True,
            ),
            engine_args=EngineArgs(
                max_model_len=8192,
                enforce_eager=True,
                disable_cuda_graph=True,
                enable_sleep_mode=False,
                gpu_memory_utilization=0.75,
                swap_space=4,
                num_scheduler_steps=1,
                max_num_seqs=32,
                max_num_batched_tokens=1024,
                enable_chunked_prefill=False,
                multi_step_stream_outputs=False,
            ),
        ),
    )

    print("๐Ÿ“ Registering model with backend...")
    await model.register(backend)

    print("๐ŸŽ‰ Setup Complete!")
    return backend, model


if __name__ == "__main__":
    print("๐Ÿงช ART SkyPilot Cluster Setup (Data SQL Agent)")
    asyncio.run(setup_cluster())

Error

โš™๏ธŽ Job submitted, ID: 1
โ”œโ”€โ”€ Waiting for task resources on 1 node.
โ””โ”€โ”€ Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(setup pid=2689) downloading uv 0.8.18 x86_64-unknown-linux-gnu
(setup pid=2689) no checksums to verify
(setup pid=2689) installing to /home/ubuntu/.local/bin
(setup pid=2689)   uv
(setup pid=2689)   uvx
(setup pid=2689) everything's installed!
(setup pid=2689) error: No virtual environment found; run `uv venv` to create an environment, or pass `--system` to install into a non-virtual environment
ERROR: Job 1's setup failed with return code list: [2]

ecatkins avatar Sep 18 '25 14:09 ecatkins