[Core] Install SkyPilot runtime in separate env
Fixes #2722
Background
Installing every skypilot runtime in the base environment can cause issues when users also tries to install their dependencies in the base environment on the remote VM. It is very easy to cause the VM to become in a unexpected state. We now move all the skypilot runtime to a new conda environment for better isolation from the users' own python environment.
Profiling
(average across three times)
sky launch -c test-aws --cloud aws --cpus 2
- Original: 3m2s
- conda implementation: 3m16s
- venv implementation (current implementation): 2m42s
(average across 5 times)
for i in {1..5}; do time sky launch -c test-aws-$i --cloud aws --cpus 2 -y; done
- master (conda base, 7c514ba): 2m23.9s
2m34.249s 2m30.460s 2m29.660s 2m14.271s 2m11.077s - venv implementation (current, 841f1ff): 2m39.8s (16s longer)
3m0.389s 2m56.428s 2m37.248s 2m16.727s 2m28.398s
sky launch -c test-gcp --cloud gcp --cpus 2
- Original: 2m50s
- conda implementation: 3m38s
- venv implementation (current implementation): 3m3s
(average across 5 times)
for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done
- master (conda base, 7c514ba): 2m43.3s
2m41.862s 2m56.141s 2m39.850s 2m41.099s 2m37.994s - venv implementation (current, 841f1ff): 3m10.4s (27s longer)
3m20.109s 3m9.852s 3m2.220s 3m17.485s 3m2.733s
TODOs
- [x] Make other clouds work with the new separate venv
- [x] Backward compatibility test
- [ ] Reduce the time for provisioning
Tested (run the relevant ones):
- [ ] Code formatting:
bash format.sh - [ ] Any manual or new tests for this PR (please specify below)
- [x]
sky launch -c test-launch --cloud aws --num-nodes 2 --cpus 2+ echo hi - [x]
sky launch -c test-launch --cloud gcp --num-nodes 2 --cpus 2+ echo hi - [x]
sky launch -c test-launch --cloud gcp --gpus tpu-v2-8 echo hi - [x]
sky launch -c test-launch --cloud gcp test.yaml(TPU VM: tpu-v3-32 for TPU pod test) - [x]
sky launch -c test-launch --cloud azure --num-nodes 2 --cpus 2+ echo hi - [x]
sky launch -c test-launch --cloud ibm --num-nodes 2 --cpus 2+ echo hi - [ ]
sky launch -c test-launch --cloud lambda --num-nodes 2 --cpus 2+ echo hi
- [x]
- [x] All smoke tests:
pytest tests/test_smoke.py - [x] All smoke tests:
pytest tests/test_smoke.py --aws - [x] Backward compatibility tests:
bash tests/backward_comaptibility_tests.sh- [x] AWS
- [x] GCP
- [x] existing spot controller
- [x] existing serve controller
Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)
We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.
Can we consider adding the
constants.ACTIVATE_PYTHON_ENVto SSHCommandRunner.run? It seems like will make the code cleaner.
This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by SSHCommandRunner.run, e.g.:
- https://github.com/skypilot-org/skypilot/blob/27dc07bf1bdaeeebe82db80961d2153b0d6a4d9f/sky/backends/cloud_vm_ray_backend.py#L3357-L3358
- https://github.com/skypilot-org/skypilot/blob/27dc07bf1bdaeeebe82db80961d2153b0d6a4d9f/sky/skylet/attempt_skylet.py#L21
It might not be worth it to increase the complexity of the
SSHCommandRunner.runcommand. Wdyt?
Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)
We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.
Can we consider adding the
constants.ACTIVATE_PYTHON_ENVto SSHCommandRunner.run? It seems like will make the code cleaner.This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by
SSHCommandRunner.run, e.g.:
https://github.com/skypilot-org/skypilot/blob/27dc07bf1bdaeeebe82db80961d2153b0d6a4d9f/sky/backends/cloud_vm_ray_backend.py#L3357-L3358
https://github.com/skypilot-org/skypilot/blob/27dc07bf1bdaeeebe82db80961d2153b0d6a4d9f/sky/skylet/attempt_skylet.py#L21
It might not be worth it to increase the complexity of the
SSHCommandRunner.runcommand. Wdyt?
Make sense! Lets keep current implementationš«”
Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment).
Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment).
Thanks for the request @dongreenberg! Is it possible to activate the environment source ~/skypilot-runtime/bin/activate in your code as well? I suppose this will not increase the start time : )
That would work! Will the command runner allow specifying a conda environment to run in?
That would work! Will the command runner allow specifying a conda environment to run in?
Currently, we decide not to have the argument for the command runner to make the API cleaner, and make the commands to be run self contained. Do you think it would be possible to do something as the following instead?
def custom_run(runner, cmd, *args, **kwargs):
cmd = f'source ~/skypilot-runtime/bin/activate; {cmd}'
return runner.run(cmd, *args, **kwargs)
Another user requested for this PR:
it would be great if sky's core setup is less easily interfered by user's actions
master (823999a)
2m2.152s
1m54.330s
1m57.132s
1m56.134s
1m49.895s
mean: 1m55.9286s
ae84211:
for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done
2m27.360s
2m19.898s
2m27.360s
2m33.147s
2m22.630s
mean: 2m26.079s (overhead: 30s)
Closing this, as it is moved to #3575