[Core] User setup causing SkyPilot runtime to fail
The following task.yaml could cause failure of job submission, with sky launch -c test task.yaml
Reproduction
resources:
cloud: aws
disk_size: 256
num_nodes: 2
setup: |
set -ex
echo "setup stage begin"
pip install --no-input img2dataset
run: |
set -ex
echo "run stage begin"
Reason
After logging into the cluster, it seems the issue is caused by installing img2dataset changed the numpy/pyarrow version in the base python environment, which somehow causes issue for skypilot-runtime in a different python venv.
ssh test
source ~/skypilot-runtime/bin/activate
ray job list
JobDetails(type=<JobType.SUBMISSION: 'SUBMISSION'>, job_id=None, submission_id='5-ubuntu', driver_info=None, status=<JobStatus.FAILED: 'FAILED'>, entrypoint='/home/ubuntu/skypilot-runtime/bin/python -u ~/.sky/sky_app/sky_job_5 > ~/sky_logs/sky-2024-10-16-22-16-21-150087/run.log 2> /dev/null', message='Unexpected error occurred: The actor died because of an error raised in its creation task, \x1b[36mray::_ray_internal_job_actor_5-ubuntu:JobSupervisor.__init__()\x1b[39m (pid=4560, ip=172.31.83.142, actor_id=b812234ae2b45cf7b4ee51d501000000, repr=<ray.dashboard.modules.job.job_manager.JobSupervisor object at 0x7b86d8213250>)\n File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 451, in result\n return self.__get_result()\n File "/opt/conda/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result\n raise self._exception\n File "/home/ubuntu/skypilot-runtime/lib/python3.10/site-packages/ray/util/serialization_addons.py", line 39, in apply\n _register_custom_datasets_serializers(serialization_context)\n File "/opt/conda/lib/python3.10/site-packages/pyarrow/__init__.py", line 65, in <module>\n import pyarrow.lib as _lib\n File "pyarrow/lib.pyx", line 36, in init pyarrow.lib\nImportError: numpy.core.multiarray failed to import', error_type=None, start_time=1729117024917, end_time=1729117026116, metadata={}, runtime_env={}, driver_agent_http_address=None, driver_node_id=None, driver_exit_code=None)]
Potential fixes
We may need to be careful with the --system-site-packages option in our skypilot-runtime setup when creating the venv, as packages changed in the base env may affect skypilot runtime as well.
https://github.com/skypilot-org/skypilot/blob/53380e26f01452559012d57b333b17f40dd8a4d1/sky/skylet/constants.py#L158
Tested with removing such argument from the skypilot-runtime setup, and it seems the problem goes away. We should avoid this argument in our hosted image (cc'ing @yika-luo) and see if we should get rid of it for custom images as well (this may cause much longer provisioning time due to more packages to be installed instead of using the system existing ones).
Testing the impact on provisioning time now
Testing the impact on provisioning time now
I suppose we should avoid this argument in our host image creation, i.e. the packer file. In that case, it should not affect the provisioning time?
The latest custom images don't use --system-site-packages Also tested the example yaml and works fine
This issue happens because the virtual environment created with --system-site-packages is not a copy of the system packages. Instead, it is linked to or inherits from the global (system) site-packages directory.
So we should get rid of --system-site-packages in our sky launch setup command and explicitly install things we need. Will test this out