deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[feature-request] Support installing dependencies in requirements.txt from CodeArtifact for both training/inference SageMaker Containers

Open humanzz opened this issue 1 year ago • 10 comments

Checklist

  • [ ] I've prepended issue tag with type of change: [feature]
  • [ ] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented the tests I've run on the DLC image
  • [ ] I'm using an existing DLC image listed here: https://docs.aws.amazon.com/deep-learning-containers/latest/devguide/deep-learning-containers-images.html
  • [ ] I've built my own container based off DLC (and I've attached the code used to build my own image)

Concise Description:

For training/inference containers supporting installing additional dependencies via requirements.txt, rather than using the public pypi index, allow passing necessary parameters to allow for installing the dependencies from a CodeArtifact repository instead.

Is your feature request related to a problem? Please describe.

With security policies requiring running training jobs/endpoints in an internet-isolated VPC, leveraging requirements.txt to install additional dependencies on training/inference containers is not possible. Being able to leverage CodeArtifact - rather than pypi public index - would allow users of requirements.txt to adhere to security best practices to isolate their training/inference runtimes from the internet.

Describe the solution you'd like

  • Leverage the ability to set environment variables when creating a training job or a model to pass environment variables indicating which CodeArtifact repository to use (domain, domain owner, repository)
  • https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateTrainingJob.html
  • https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateModel.html
  • https://docs.aws.amazon.com/codeartifact/latest/ug/python-configure-pip.html
  • If the environment variables are set, the container configures pip to use codeartifact prior to installing the dependencies in requirements.txt. Otherwise, it uses pypi index as usual.

Describe alternatives you've considered

N/A

Additional context

  • Here's a blog post https://aws.amazon.com/blogs/machine-learning/secure-aws-codeartifact-access-for-isolated-amazon-sagemaker-notebook-instances/ describing some of the benefits of using CodeArtifact - but in SageMaker Notebooks that are Internet-isolated.
  • Similarly, running training jobs/deploying endpoints using an isolated VPC disallows the usage of requirements.txt. Allowing passing CodeArtifact configurations, and for containers to leverage those configurations in order install the dependencies from requirements.txt from a CodeArtifact repository would be an ideal solution.
  • There are 2 related feature request made to sagemaker-training-toolkit/sagemaker-inference-toolkit at https://github.com/aws/sagemaker-training-toolkit/issues/167 and https://github.com/aws/sagemaker-inference-toolkit/issues/85

humanzz avatar Dec 13 '22 20:12 humanzz

I've submitted 2 near-identical PRs to both sagemaker-training-toolkit and sagemaker-inference-toolkit at

  • https://github.com/aws/sagemaker-inference-toolkit/pull/130
  • https://github.com/aws/sagemaker-training-toolkit/pull/187

My understanding is that if these get merged, then new containers leveraging those packages should start having CodeArtifact support

humanzz avatar Jul 10 '23 14:07 humanzz

I've also submitted a pr for sagemaker-pytorch-inference-toolkit at https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/150

humanzz avatar Jul 17 '23 18:07 humanzz

Inference-side changes have been merged

  • https://github.com/aws/sagemaker-inference-toolkit/pull/130
  • https://github.com/aws/sagemaker-pytorch-inference-toolkit/pull/150

humanzz avatar Jul 24 '23 18:07 humanzz

Inference-side changes have been released at

  • https://github.com/aws/sagemaker-inference-toolkit/releases/tag/v1.10.0
  • https://github.com/aws/sagemaker-pytorch-inference-toolkit/releases/tag/v2.0.16

At the moment of writing this comment, it seems that the PyTorch inference container use sagemaker-pytorch-inference-toolkit==2.0.14 as per https://github.com/search?q=repo%3Aaws%2Fdeep-learning-containers%20SM_TOOLKIT_VERSION&type=code

For PyTorch inference containers to pickup CodeArtifact support, they need to move to sagemaker-pytorch-inference-toolkit >= 2.0.16

humanzz avatar Aug 02 '23 19:08 humanzz

All of the above PRs - coupled with the release of new container versions that have those updated package versions - provides CodeArtifact support by when the environment variable CA_REPOSITORY_ARN is set to the arn of the desired CodeArtifact respository.

The other part to leveraging this feature requires updating the IAM policies

  1. SageMaker Execution Role would need to be updated to permit access to CodeArtifact
  2. CodeArtifact repository resource policy might also require updates

SageMaker Execution Role example policy

{
    "Version": "2012-10-17",
    "Statement": [
       {
          "Action": [
                "codeartifact:GetAuthorizationToken",
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:ReadFromRepository"
          ],
          "Effect": "Allow",
          "Resource": "*"
       },
       {
          "Effect": "Allow",
          "Action": "sts:GetServiceBearerToken",
          "Resource": "*",
             "Condition": {
                "StringEquals": {
                   "sts:AWSServiceName": "codeartifact.amazonaws.com"
                }
             }
       }
     ]
 }

CodeArtifact respository example resource policy to permit the above role from account 123456789012

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "codeartifact:DescribePackageVersion",
                "codeartifact:DescribeRepository",
                "codeartifact:GetPackageVersionReadme",
                "codeartifact:GetRepositoryEndpoint",
                "codeartifact:ListPackages",
                "codeartifact:ListPackageVersions",
                "codeartifact:ListPackageVersionAssets",
                "codeartifact:ListPackageVersionDependencies",
                "codeartifact:ReadFromRepository"
            ],
            "Effect": "Allow",
            "Principal": {
                 "AWS": "arn:aws:iam::123456789012:root"
            },
            "Resource": "*"
        }
    ]
}

humanzz avatar Aug 07 '23 10:08 humanzz

training side changes have been released at

  • https://github.com/aws/sagemaker-training-toolkit/pull/187
  • https://github.com/aws/sagemaker-training-toolkit/releases/tag/v4.7.0

humanzz avatar Aug 08 '23 17:08 humanzz

This means that the remaining parts are more or less contained within this repo to update/release new container versions

  • release of new training container versions to leverage sagemaker-training>=4.7.0 (most docker files would allow that)
  • release of new inference containers to leverage sagemaker-inference>=1.10.0
  • merge of https://github.com/aws/deep-learning-containers/pull/3227 so they leverage sagemaker-pytorch-inference>=2.0.16

humanzz avatar Aug 08 '23 17:08 humanzz

Sagemaker PyTorch 2.0.1 Inference containers now support CodeArtifact

  • https://github.com/aws/deep-learning-containers/pull/3227 has been merged
  • A new version of the 2.0.1 containers have been released e.g. https://github.com/aws/deep-learning-containers/releases/tag/v1.6-pt-sagemaker-2.0.1-inf-cpu-py310

For training, this is likely to happen after

  • https://github.com/aws/deep-learning-containers/pull/3172 is merged
  • New container versions are released (thereby taking advantage of https://github.com/aws/sagemaker-training-toolkit/pull/187)

humanzz avatar Aug 16 '23 10:08 humanzz

PyTorch 2.0.1 training images, with CodeArtifact support, have been released e.g. https://github.com/aws/deep-learning-containers/releases/tag/v1.3-pt-sagemaker-2.0.1-tr-cpu-py310

humanzz avatar Aug 29 '23 15:08 humanzz

Summary of the work to get this into PT 2.0.1 training/inference images, and to hopefully enable this to flow to more more frameworks

flowchart TD
    inferencepr["fa:fa-code-pull-request feat: support codeartifact for installing requirements.txt packages sagemaker-inference-toolkit#130
"] --> inferencerepo
    inferencerepo["fa:fa-code sagemaker-inference-toolkit"] --> inferencerelease
    inferencerelease["fa:fa-cube sagemaker-inference 1.10.0"]
    inferencerelease -.-> ptinferencerelease
    inferencerelease --> dlcptinferencerelease

    ptinferencepr["fa:fa-code-pull-request reuse sagemaker-inference's requirements.txt installation logic sagemaker-pytorch-inference-toolkit#150"] --> ptinferencerepo
    ptinferencerepo["fa:fa-code sagemaker-pytorch-inference-toolkit"] --> ptinferencerelease
    ptinferencerelease["fa:fa-cube sagemaker-pytorch-inference 2.0.16"]
    ptinferencerelease --> dlcptinferencerelease



    dlcpr["fa:fa-code-pull-request [PyTorch] Update sagemaker-pytorch-inference to 2.0.16 deep-learning-containers#3227"] --> dlcrepo
    dlcrepo["fa:fa-code deep-learning-containers"]
    dlcptinferencerelease["fa:fa-cube v1.6-pt-sagemaker-2.0.1-inf-cpu-py310"]
    dlcpttrainingrelease["fa:fa-cube v1.3-pt-sagemaker-2.0.1-tr-cpu-py310"]
    dlcrepo --> dlcptinferencerelease
    dlcrepo --> dlcpttrainingrelease
    dlcrepo ---> otherreleases



    trainingpr["fa:fa-code-pull-request feat: support codeartifact for installing requirements.txt packages sagemaker-training-toolkit#187
"] --> trainingrepo
    trainingrepo["fa:fa-code sagemaker-training-toolkit"] --> trainingrelease
    trainingrelease["fa:fa-cube sagemaker-training 4.7.0"]
    trainingrelease --> dlcpttrainingrelease

    otherreleases["fa:fa-cube future image releases using sagemaker-inference>=1.10.0 and sagemaker-training>=4.7.0"]

humanzz avatar Aug 29 '23 18:08 humanzz