ART icon indicating copy to clipboard operation
ART copied to clipboard

uv venv error initializing art backend on skypilot/aws

Open ecatkins opened this issue 7 months ago โ€ข 4 comments

I imagine this boils down to some change in a dependency - but raising here in case others encounter. I have not been able to identify a fix on my end.

I had the below script working for me over a period of a few weeks to successfully set up a training environment. However overnight, it now produces an error with the uv environment when trying to setup.

Dependencies

openpipe-art[langgraph]>=0.4.11
skypilot[aws]>=0.10.3

Script

import asyncio
import sky
from art.skypilot.backend import SkyPilotBackend
from art import TrainableModel
from art.dev import InternalModelConfig, InitArgs, EngineArgs


async def setup_cluster():
    print("๐Ÿš€ Setting up SkyPilot Cluster for Data SQL Agent")
    print("=" * 50)

    resources = sky.Resources(
        cloud=sky.AWS(),
        accelerators="H100:1",
    )

    print("๐Ÿ”ง Initializing cluster...")
    backend = await SkyPilotBackend.initialize_cluster(
        cluster_name="data-sql-agent-cluster",
        resources=resources,
    )

    print("โœ… Cluster initialized successfully!")
    print(f"๐Ÿ“ก Backend: {backend}")

    print("๐Ÿค– Creating TrainableModel...")
    model = TrainableModel(
        name="data-sql-agent-v1",
        project="data-sql-agent",
        base_model="Qwen/Qwen2.5-7B-Instruct",
        _internal_config=InternalModelConfig(
            init_args=InitArgs(
                max_seq_length=8192,
                enable_prefix_caching=False,
                load_in_4bit=True,
                fast_inference=True,
            ),
            engine_args=EngineArgs(
                max_model_len=8192,
                enforce_eager=True,
                disable_cuda_graph=True,
                enable_sleep_mode=False,
                gpu_memory_utilization=0.75,
                swap_space=4,
                num_scheduler_steps=1,
                max_num_seqs=32,
                max_num_batched_tokens=1024,
                enable_chunked_prefill=False,
                multi_step_stream_outputs=False,
            ),
        ),
    )

    print("๐Ÿ“ Registering model with backend...")
    await model.register(backend)

    print("๐ŸŽ‰ Setup Complete!")
    return backend, model


if __name__ == "__main__":
    print("๐Ÿงช ART SkyPilot Cluster Setup (Data SQL Agent)")
    asyncio.run(setup_cluster())

Error

โš™๏ธŽ Job submitted, ID: 1
โ”œโ”€โ”€ Waiting for task resources on 1 node.
โ””โ”€โ”€ Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(setup pid=2689) downloading uv 0.8.18 x86_64-unknown-linux-gnu
(setup pid=2689) no checksums to verify
(setup pid=2689) installing to /home/ubuntu/.local/bin
(setup pid=2689)   uv
(setup pid=2689)   uvx
(setup pid=2689) everything's installed!
(setup pid=2689) error: No virtual environment found; run `uv venv` to create an environment, or pass `--system` to install into a non-virtual environment
ERROR: Job 1's setup failed with return code list: [2]

ecatkins avatar Sep 18 '25 14:09 ecatkins

So the problem is you are not passing a project root here, the setup script in backend.py for skypilot is something like this:

        if art_version_is_semver:
            art_installation_command = (
                f"uv pip install openpipe-art[backend]=={art_version}"
            )
        elif os.path.exists(art_version):
            # copy the contents of the art_path onto the new machine
            task.workdir = art_version
            art_installation_command = "uv sync --extra backend"
        else:
            raise ValueError(
                f"Invalid art_version: {art_version}. Must be a semver or a path to a local directory."
            )

        setup_script = f"""
    curl -LsSf https://astral.sh/uv/install.sh | sh

    source $HOME/.local/bin/env

    git config --global --add safe.directory /root/sky_workdir

    {art_installation_command}
    """

So when you initialize a cluster this would be run, but then the sky_workdir should also have your project root synced to actually run uv sync --extra backend.

A quick fix would be if you could try this out:

project_root = "path to your project root"
backend = await SkyPilotBackend.initialize_cluster(
        cluster_name="data-sql-agent-cluster",
        resources=resources,
        art_version=project_root,
)

abhinav262666 avatar Sep 18 '25 15:09 abhinav262666

@abhinav262666 - apologies if I am misunderstanding your comment or the way the library works.

At least the way I am reading the code. I should be able to leave that art_version=None, or to be more explicit set it to the latest version art_version="0.4.11" and it will install the environment for me with art.

And up until 24 hours ago, I was able to run my code and set up the backend for my agent RL runs in this manner.

I did try your method, and ran into some seemingly unrelated other errors

ecatkins avatar Sep 18 '25 16:09 ecatkins

I have confirmed that the break in my code appears to be related to the latest uv release (from yesterday).

Changing the setup script to pin to the previous version resolves my issue

setup_script = f"""
    curl -LsSf https://astral.sh/uv/0.8.17/install.sh | sh

    source $HOME/.local/bin/env

    git config --global --add safe.directory /root/sky_workdir

    {art_installation_command}
    """

ecatkins avatar Sep 18 '25 16:09 ecatkins

This issue appears for every SkyPilot provider. I was able to fix it by passing --system to the uv sync command iirc.

Timo972 avatar Sep 18 '25 23:09 Timo972