skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[Core] User setup causing SkyPilot runtime to fail

Open Michaelvll opened this issue 1 year ago • 4 comments

The following task.yaml could cause failure of job submission, with sky launch -c test task.yaml

Reproduction

resources:
  cloud: aws
  disk_size: 256

num_nodes: 2

setup: |
  set -ex
  echo "setup stage begin"
  pip install --no-input img2dataset

run: |
  set -ex
  echo "run stage begin"

Reason

After logging into the cluster, it seems the issue is caused by installing img2dataset changed the numpy/pyarrow version in the base python environment, which somehow causes issue for skypilot-runtime in a different python venv.

ssh test
source ~/skypilot-runtime/bin/activate
ray job list
 JobDetails(type=<JobType.SUBMISSION: 'SUBMISSION'>, job_id=None, submission_id='5-ubuntu', driver_info=None, status=<JobStatus.FAILED: 'FAILED'>, entrypoint='/home/ubuntu/skypilot-runtime/bin/python -u ~/.sky/sky_app/sky_job_5 > ~/sky_logs/sky-2024-10-16-22-16-21-150087/run.log 2> /dev/null', message='Unexpected error occurred: The actor died because of an error raised in its creation task, \x1b[36mray::_ray_internal_job_actor_5-ubuntu:JobSupervisor.__init__()\x1b[39m (pid=4560, ip=172.31.83.142, actor_id=b812234ae2b45cf7b4ee51d501000000, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at 0x7b86d8213250>)\n  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 451, in result\n    return self.__get_result()\n  File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result\n    raise self._exception\n  File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/util/serialization_addons.py", line 39, in apply\n    _register_custom_datasets_serializers(serialization_context)\n  File "/opt/conda/lib/python3.10/site-packages/pyarrow/__init__.py", line 65, in <module>\n    import pyarrow.lib as _lib\n  File "pyarrow/lib.pyx", line 36, in init pyarrow.lib\nImportError: numpy.core.multiarray failed to import', error_type=None, start_time=1729117024917, end_time=1729117026116, metadata={}, runtime_env={}, driver_agent_http_address=None, driver_node_id=None, driver_exit_code=None)]

Potential fixes

We may need to be careful with the --system-site-packages option in our skypilot-runtime setup when creating the venv, as packages changed in the base env may affect skypilot runtime as well.

https://github.com/skypilot-org/skypilot/blob/53380e26f01452559012d57b333b17f40dd8a4d1/sky/skylet/constants.py#L158

Tested with removing such argument from the skypilot-runtime setup, and it seems the problem goes away. We should avoid this argument in our hosted image (cc'ing @yika-luo) and see if we should get rid of it for custom images as well (this may cause much longer provisioning time due to more packages to be installed instead of using the system existing ones).

Michaelvll avatar Oct 16 '24 22:10 Michaelvll

Testing the impact on provisioning time now

yika-luo avatar Oct 16 '24 22:10 yika-luo

Testing the impact on provisioning time now

I suppose we should avoid this argument in our host image creation, i.e. the packer file. In that case, it should not affect the provisioning time?

Michaelvll avatar Oct 16 '24 22:10 Michaelvll

The latest custom images don't use --system-site-packages Also tested the example yaml and works fine

yika-luo avatar Oct 22 '24 21:10 yika-luo

This issue happens because the virtual environment created with --system-site-packages is not a copy of the system packages. Instead, it is linked to or inherits from the global (system) site-packages directory.

So we should get rid of --system-site-packages in our sky launch setup command and explicitly install things we need. Will test this out

yika-luo avatar Oct 24 '24 22:10 yika-luo