skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

ECR `image_id` pull fails with command-substituted `SKYPILOT_DOCKER_PASSWORD` due to dependency timing (Blocks PR #4871 use case)

Open andylizf opened this issue 7 months ago • 4 comments

The functionality introduced in PR #4871, which supports command substitution in SKYPILOT_DOCKER_PASSWORD (e.g., $(aws ecr get-login-password ...)), cannot be effectively used when resources.image_id points to an ECR image. The initial docker pull of this ECR image on the provisioned VM fails due to a dependency timing conflict.

Problem Details:

  1. SkyPilot provisions a base VM.
  2. When PR #4871's feature is used, SkyPilot attempts docker login by executing the command in SKYPILOT_DOCKER_PASSWORD, followed by docker pull for the specified ECR image_id. This happens on the newly provisioned VM.
  3. The critical issue is that SkyPilot's standard procedure for copying necessary user credentials, files, or tools (e.g., ~/.aws/credentials, the AWS CLI itself if not pre-installed, or custom scripts which the password command might be) typically happens in a later provisioning stage.
  4. Consequently, the command in SKYPILOT_DOCKER_PASSWORD executes before its required dependencies are available on the VM. This leads to command failure, docker login failure, and ultimately, the ECR image_id pull fails.

Impact on PR #4871: This timing conflict is a significant blocker for using PR #4871's command substitution feature for a primary use case: dynamically authenticating to ECR (or similar registries) when the ECR image itself is the runtime environment specified via image_id. While the substitution mechanism itself might be in place, it fails in practice for image_id scenarios due to this provisioning order.

Expected Behavior/Resolution Path: To enable PR #4871 for ECR image_id use cases, the dependencies (tools, configs, scripts) required by the SKYPILOT_DOCKER_PASSWORD command must be present and configured on the VM at the moment docker login is attempted for the initial image_id pull.

Potential solutions could involve:

  • Modifying SkyPilot's provisioning sequence to ensure these specific dependencies are deployed to the VM before the docker login for image_id is attempted.
  • Introducing a dedicated pre-flight mechanism for staging just these critical Docker authentication dependencies.

Addressing this is essential for PR #4871 to successfully support dynamic ECR authentication for containerized runtime environments.

andylizf avatar May 22 '25 16:05 andylizf

CC'ing @cblmemo here

andylizf avatar May 22 '25 16:05 andylizf

Both sounds a little bit hacky to me. Lets defer this issue for now.

cblmemo avatar Jun 02 '25 22:06 cblmemo

In our project we handle ECR auth with this patch:

diff -ur sky/provision/docker_utils.py sky-patched/provision/docker_utils.py
--- sky/provision/docker_utils.py	2025-06-02 23:54:17
+++ sky-patched/provision/docker_utils.py	2025-06-16 15:57:46
@@ -234,6 +234,23 @@
                     f'--password {shlex.quote(docker_login_config.password)} '
                     f'{shlex.quote(docker_login_config.server)}',
                     wait_for_docker_daemon=True)
+            elif docker_login_config.server.endswith(".amazonaws.com") and ".ecr." in docker_login_config.server:
+                # Handle AWS ECR authentication
+                server = docker_login_config.server
+                if server.endswith(".amazonaws.com") and ".ecr." in server:
+                    self._run('sudo apt install -y amazon-ecr-credential-helper')
+                    # Create the docker config directory if it doesn't exist
+                    self._run('mkdir -p ~/.docker')
+                    # Create or update the docker config.json file with ECR credentials helper
+                    config_json = {
+                        "credHelpers": {
+                            server: "ecr-login"
+                        }
+                    }
+                    # Write the config to a temporary file and move it to the right location
+                    import json
+                    self._run(f'echo \'{json.dumps(config_json, indent=2)}\' > /tmp/docker_config.json && '
+                              f'mv /tmp/docker_config.json ~/.docker/config.json')
             elif docker_login_config.server.endswith('-docker.pkg.dev'):
                 # Docker image server is on GCR, we need to do additional setup
                 # to pull the image.
diff -ur sky/resources.py sky-patched/resources.py
--- sky/resources.py	2025-06-16 15:54:40
+++ sky-patched/resources.py	2025-06-16 15:57:08
@@ -1699,9 +1699,6 @@
         add_if_not_none('labels', self.labels)
         if self._autostop_config is not None:
             config['autostop'] = self._autostop_config.to_yaml_config()
-        if self._docker_login_config is not None:
-            config['_docker_login_config'] = dataclasses.asdict(
-                self._docker_login_config)
         if self._docker_username_for_runpod is not None:
             config['_docker_username_for_runpod'] = (
                 self._docker_username_for_runpod)

berekuk avatar Jun 19 '25 23:06 berekuk

Thanks for sharing @berekuk -- would you like to submit a PR?

concretevitamin avatar Jun 20 '25 13:06 concretevitamin

Thanks to @berekuk. We recently have a similar implementation in 1b3287a. Feel free to adopt our next release and remove this patch.

andylizf avatar Sep 17 '25 00:09 andylizf