uv venv error initializing art backend on skypilot/aws
I imagine this boils down to some change in a dependency - but raising here in case others encounter. I have not been able to identify a fix on my end.
I had the below script working for me over a period of a few weeks to successfully set up a training environment. However overnight, it now produces an error with the uv environment when trying to setup.
Dependencies
openpipe-art[langgraph]>=0.4.11
skypilot[aws]>=0.10.3
Script
import asyncio
import sky
from art.skypilot.backend import SkyPilotBackend
from art import TrainableModel
from art.dev import InternalModelConfig, InitArgs, EngineArgs
async def setup_cluster():
print("๐ Setting up SkyPilot Cluster for Data SQL Agent")
print("=" * 50)
resources = sky.Resources(
cloud=sky.AWS(),
accelerators="H100:1",
)
print("๐ง Initializing cluster...")
backend = await SkyPilotBackend.initialize_cluster(
cluster_name="data-sql-agent-cluster",
resources=resources,
)
print("โ
Cluster initialized successfully!")
print(f"๐ก Backend: {backend}")
print("๐ค Creating TrainableModel...")
model = TrainableModel(
name="data-sql-agent-v1",
project="data-sql-agent",
base_model="Qwen/Qwen2.5-7B-Instruct",
_internal_config=InternalModelConfig(
init_args=InitArgs(
max_seq_length=8192,
enable_prefix_caching=False,
load_in_4bit=True,
fast_inference=True,
),
engine_args=EngineArgs(
max_model_len=8192,
enforce_eager=True,
disable_cuda_graph=True,
enable_sleep_mode=False,
gpu_memory_utilization=0.75,
swap_space=4,
num_scheduler_steps=1,
max_num_seqs=32,
max_num_batched_tokens=1024,
enable_chunked_prefill=False,
multi_step_stream_outputs=False,
),
),
)
print("๐ Registering model with backend...")
await model.register(backend)
print("๐ Setup Complete!")
return backend, model
if __name__ == "__main__":
print("๐งช ART SkyPilot Cluster Setup (Data SQL Agent)")
asyncio.run(setup_cluster())
Error
โ๏ธ Job submitted, ID: 1
โโโ Waiting for task resources on 1 node.
โโโ Job started. Streaming logs... (Ctrl-C to exit log streaming; job will not be killed)
(setup pid=2689) downloading uv 0.8.18 x86_64-unknown-linux-gnu
(setup pid=2689) no checksums to verify
(setup pid=2689) installing to /home/ubuntu/.local/bin
(setup pid=2689) uv
(setup pid=2689) uvx
(setup pid=2689) everything's installed!
(setup pid=2689) error: No virtual environment found; run `uv venv` to create an environment, or pass `--system` to install into a non-virtual environment
ERROR: Job 1's setup failed with return code list: [2]
So the problem is you are not passing a project root here, the setup script in backend.py for skypilot is something like this:
if art_version_is_semver:
art_installation_command = (
f"uv pip install openpipe-art[backend]=={art_version}"
)
elif os.path.exists(art_version):
# copy the contents of the art_path onto the new machine
task.workdir = art_version
art_installation_command = "uv sync --extra backend"
else:
raise ValueError(
f"Invalid art_version: {art_version}. Must be a semver or a path to a local directory."
)
setup_script = f"""
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
git config --global --add safe.directory /root/sky_workdir
{art_installation_command}
"""
So when you initialize a cluster this would be run, but then the sky_workdir should also have your project root synced to actually run uv sync --extra backend.
A quick fix would be if you could try this out:
project_root = "path to your project root"
backend = await SkyPilotBackend.initialize_cluster(
cluster_name="data-sql-agent-cluster",
resources=resources,
art_version=project_root,
)
@abhinav262666 - apologies if I am misunderstanding your comment or the way the library works.
At least the way I am reading the code. I should be able to leave that art_version=None, or to be more explicit set it to the latest version art_version="0.4.11" and it will install the environment for me with art.
And up until 24 hours ago, I was able to run my code and set up the backend for my agent RL runs in this manner.
I did try your method, and ran into some seemingly unrelated other errors
I have confirmed that the break in my code appears to be related to the latest uv release (from yesterday).
Changing the setup script to pin to the previous version resolves my issue
setup_script = f"""
curl -LsSf https://astral.sh/uv/0.8.17/install.sh | sh
source $HOME/.local/bin/env
git config --global --add safe.directory /root/sky_workdir
{art_installation_command}
"""
This issue appears for every SkyPilot provider. I was able to fix it by passing --system to the uv sync command iirc.