skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[Core] Install SkyPilot runtime in separate env

Open Michaelvll opened this issue 2 years ago • 8 comments

Fixes #2722

Background

Installing every skypilot runtime in the base environment can cause issues when users also tries to install their dependencies in the base environment on the remote VM. It is very easy to cause the VM to become in a unexpected state. We now move all the skypilot runtime to a new conda environment for better isolation from the users' own python environment.

Profiling

(average across three times) sky launch -c test-aws --cloud aws --cpus 2

  • Original: 3m2s
  • conda implementation: 3m16s
  • venv implementation (current implementation): 2m42s

(average across 5 times) for i in {1..5}; do time sky launch -c test-aws-$i --cloud aws --cpus 2 -y; done

  • master (conda base, 7c514ba): 2m23.9s
    2m34.249s
    2m30.460s
    2m29.660s
    2m14.271s
    2m11.077s
    
  • venv implementation (current, 841f1ff): 2m39.8s (16s longer)
    3m0.389s
    2m56.428s
    2m37.248s
    2m16.727s
    2m28.398s
    

sky launch -c test-gcp --cloud gcp --cpus 2

  • Original: 2m50s
  • conda implementation: 3m38s
  • venv implementation (current implementation): 3m3s

(average across 5 times) for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done

  • master (conda base, 7c514ba): 2m43.3s
    2m41.862s
    2m56.141s
    2m39.850s
    2m41.099s
    2m37.994s
    
  • venv implementation (current, 841f1ff): 3m10.4s (27s longer)
    3m20.109s
    3m9.852s
    3m2.220s
    3m17.485s
    3m2.733s
    

TODOs

  • [x] Make other clouds work with the new separate venv
  • [x] Backward compatibility test
  • [ ] Reduce the time for provisioning

Tested (run the relevant ones):

  • [ ] Code formatting: bash format.sh
  • [ ] Any manual or new tests for this PR (please specify below)
    • [x] sky launch -c test-launch --cloud aws --num-nodes 2 --cpus 2+ echo hi
    • [x] sky launch -c test-launch --cloud gcp --num-nodes 2 --cpus 2+ echo hi
    • [x] sky launch -c test-launch --cloud gcp --gpus tpu-v2-8 echo hi
    • [x] sky launch -c test-launch --cloud gcp test.yaml (TPU VM: tpu-v3-32 for TPU pod test)
    • [x] sky launch -c test-launch --cloud azure --num-nodes 2 --cpus 2+ echo hi
    • [x] sky launch -c test-launch --cloud ibm --num-nodes 2 --cpus 2+ echo hi
    • [ ] sky launch -c test-launch --cloud lambda --num-nodes 2 --cpus 2+ echo hi
  • [x] All smoke tests: pytest tests/test_smoke.py
  • [x] All smoke tests: pytest tests/test_smoke.py --aws
  • [x] Backward compatibility tests: bash tests/backward_comaptibility_tests.sh
    • [x] AWS
    • [x] GCP
    • [x] existing spot controller
    • [x] existing serve controller

Michaelvll avatar Nov 17 '23 23:11 Michaelvll

Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)

We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.

Can we consider adding the constants.ACTIVATE_PYTHON_ENV to SSHCommandRunner.run? It seems like will make the code cleaner.

This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by SSHCommandRunner.run, e.g.:

  1. https://github.com/skypilot-org/skypilot/blob/27dc07bf1bdaeeebe82db80961d2153b0d6a4d9f/sky/backends/cloud_vm_ray_backend.py#L3357-L3358
  2. https://github.com/skypilot-org/skypilot/blob/27dc07bf1bdaeeebe82db80961d2153b0d6a4d9f/sky/skylet/attempt_skylet.py#L21 It might not be worth it to increase the complexity of the SSHCommandRunner.run command. Wdyt?

Michaelvll avatar Nov 20 '23 06:11 Michaelvll

Will our ray (w/ v 2.7.0) use the same ray address & port as the user-installed ray? If this is the case, will there be any compatibility issues when two different versions of ray act on the same interface? (I assume it is okay since ray should have backward compatibility, but just want to make sure.)

We are using the same version of ray in the new venv, and the ray address and port have been changed to non-default one in a previous PR already #1790.

Can we consider adding the constants.ACTIVATE_PYTHON_ENV to SSHCommandRunner.run? It seems like will make the code cleaner.

This is a good point! However, the activate command needs to be called case by case and some commands are not directly invoked by SSHCommandRunner.run, e.g.:

  1. https://github.com/skypilot-org/skypilot/blob/27dc07bf1bdaeeebe82db80961d2153b0d6a4d9f/sky/backends/cloud_vm_ray_backend.py#L3357-L3358

  2. https://github.com/skypilot-org/skypilot/blob/27dc07bf1bdaeeebe82db80961d2153b0d6a4d9f/sky/skylet/attempt_skylet.py#L21

    It might not be worth it to increase the complexity of the SSHCommandRunner.run command. Wdyt?

Make sense! Lets keep current implementation🫔

cblmemo avatar Nov 20 '23 06:11 cblmemo

Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment).

dongreenberg avatar Nov 21 '23 19:11 dongreenberg

Is there a way to disable this feature (via the Python API) to produce the existing behavior? We use SkyPilot within the code running on the launched cluster, so this isolation is both not needed and even increases our start time further (because we then need to install SkyPilot and start Ray again in our own environment).

Thanks for the request @dongreenberg! Is it possible to activate the environment source ~/skypilot-runtime/bin/activate in your code as well? I suppose this will not increase the start time : )

Michaelvll avatar Nov 22 '23 23:11 Michaelvll

That would work! Will the command runner allow specifying a conda environment to run in?

dongreenberg avatar Nov 23 '23 00:11 dongreenberg

That would work! Will the command runner allow specifying a conda environment to run in?

Currently, we decide not to have the argument for the command runner to make the API cleaner, and make the commands to be run self contained. Do you think it would be possible to do something as the following instead?

def custom_run(runner, cmd, *args, **kwargs):
    cmd = f'source ~/skypilot-runtime/bin/activate; {cmd}'
    return runner.run(cmd, *args, **kwargs)

Michaelvll avatar Nov 23 '23 09:11 Michaelvll

Another user requested for this PR:

it would be great if sky's core setup is less easily interfered by user's actions

Michaelvll avatar Dec 24 '23 16:12 Michaelvll

master (823999a)

2m2.152s
1m54.330s
1m57.132s
1m56.134s
1m49.895s

mean: 1m55.9286s

ae84211: for i in {1..5}; do time sky launch -c test-gcp-$i --cloud gcp --cpus 2 -y; done

2m27.360s
2m19.898s
2m27.360s
2m33.147s
2m22.630s

mean: 2m26.079s (overhead: 30s)

Michaelvll avatar Mar 18 '24 07:03 Michaelvll

Closing this, as it is moved to #3575

Michaelvll avatar May 21 '24 18:05 Michaelvll