fix: add missing Tensorflow 2.9 inference image
Issue #, if available: #3250
Description of changes:
Add the missing Tensorflow 2.9 inference image.
Note that I'm considering this to be a "fix", not a "feature" as the training image for 2.9 is in and released, but the inference image is missing, so I think this is a bug.
There might well be some automation or script or something that can do this, but I believe I, as an end user, don't have the ability to list all images in a repository, so this is just based of my finding that I can pull tensorflow-inference:2.9.0-cpu (and can't pull tensorflow-inference:2.9.1-cpu).
Testing done:
$ aws ecr get-login-password --region eu-west-1 | docker login --username AWS --password-stdin "763104351884.dkr.ecr.eu-west-1.amazonaws.com"
Login Succeeded
$ docker pull 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.9.0-cpu
2.9.0-cpu: Pulling from tensorflow-inference
...
$ docker pull 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.9-cpu
2.9-cpu: Pulling from tensorflow-inference
...
$ docker pull 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.9.0-gpu
2.9.0-gpu: Pulling from tensorflow-inference
...
$ docker pull 763104351884.dkr.ecr.eu-west-1.amazonaws.com/tensorflow-inference:2.9-gpu
2.9-gpu: Pulling from tensorflow-inference
...
So 2.9 and 2.9.0 exist in both cpu and gpu formats, in the eu-west-1 registries, at least.
Merge Checklist
Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.
General
- [x] I have read the CONTRIBUTING doc
- [x] I certify that the changes I am introducing will be backword compatible, and I have discussed concerns about this, if any, with the Python SDK team
- [x] I used the commit message format described in CONTRIBUTING
- [x] I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
- [x] I have updated any necessary documentation, including READMEs and API docs (if appropriate)
Tests
- [x] I have added tests that prove my fix is effective or that my feature works (if appropriate)
- [x] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
- [x] I have checked that my tests are not configured for a specific region or account (if appropriate)
- [x] I have used
unique_name_from_baseto create resource names in integ tests (if appropriate)
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: 35426df8373e60101974241d5fa5bda1833dec69
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: 35426df8373e60101974241d5fa5bda1833dec69
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: 35426df8373e60101974241d5fa5bda1833dec69
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: 35426df8373e60101974241d5fa5bda1833dec69
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Codecov Report
Merging #3251 (e0274dc) into master (284ddbe) will decrease coverage by
0.79%. The diff coverage isn/a.
@@ Coverage Diff @@
## master #3251 +/- ##
==========================================
- Coverage 89.82% 89.03% -0.80%
==========================================
Files 645 203 -442
Lines 55518 18275 -37243
==========================================
- Hits 49871 16271 -33600
+ Misses 5647 2004 -3643
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: 35426df8373e60101974241d5fa5bda1833dec69
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: 9cafeb0ace3ce99959d3c4412ae0d63968df4612
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: 9cafeb0ace3ce99959d3c4412ae0d63968df4612
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: 9cafeb0ace3ce99959d3c4412ae0d63968df4612
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: 9cafeb0ace3ce99959d3c4412ae0d63968df4612
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: 9cafeb0ace3ce99959d3c4412ae0d63968df4612
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: 8fdeda4b43c725aea4e683e87b3d0e89e6df68d4
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
@kevinyang8 @navinsoni I think this change is still required, but some of the tests are still failing. I think this is because the training image version is 2.9.1, but the inference version I'm adding is 2.9.0. However, no training image 2.9.0 exists, and no inference image 2.9.1 exists, and the SDK (or the tests) assume that matching images will exist.
https://github.com/aws/sagemaker-python-sdk/blob/7d30d8c6f2149e9f02089367389afd1c58825092/src/sagemaker/image_uri_config/tensorflow.json#L1501
Training image 2.9.1 does exist:
$ docker pull "$(python -c 'import sagemaker; print(sagemaker.image_uris.retrieve(framework="tensorflow", region="us-west-2", version="2.9.1", image_scope="training", instance_type="ml.t3.medium"))')"
2.9.1-cpu-py39: Pulling from tensorflow-training
d7bfe07ed847: Already exists
223cc3730ba4: Downloading [> ] 538.2kB/233.6MB
...
But training image 2.9.0 does not. It's not set in the SDK's image uris:
$ python -c 'import sagemaker; print(sagemaker.image_uris.retrieve(framework="tensorflow", region="us-west-2", version="2.9.0", image_scope="training", instance_type="ml.t3.medium"))' [1]
Traceback (most recent call last):
...
ValueError: Unsupported tensorflow version: 2.9.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer tensorflow versions. Supported tensorflow version(s): 1.10.0, 1.11.0, 1.12.0, 1.13.1, 1.14.0, 1.15.0, 1.15.2, 1.15.3, 1.15.4, 1.15.5, 1.4.1, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.1, 2.4.3, 2.5.0, 2.5.1, 2.6.0, 2.6.2, 2.6.3, 2.7.1, 2.8.0, 2.9.1, 1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9.
And it doesn't exist in the source ECR repos:
$ docker pull "$(python -c 'import sagemaker; print(sagemaker.image_uris.retrieve(framework="tensorflow", region="us-west-2", version="2.9.1", image_scope="training", instance_type="ml.t3.medium"))' | sed 's/2\.9\.1/2.9.0/')"
Error response from daemon: manifest for 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:2.9.0-cpu-py39 not found: manifest unknown: Requested image not found
And the CI failure I'm seeing looks like:
=================================== FAILURES ===================================
__________ test_deploy_with_input_handlers[ml.p3.2xlarge-2.9.0-2.9.1] __________
[gw264] linux -- Python 3.9.13 /codebuild/output/src599037907/src/github.com/aws/sagemaker-python-sdk/.tox/py39/bin/python
sagemaker_session = <sagemaker.session.Session object at 0x7f9ffbf98c10>
instance_type = 'ml.p3.2xlarge', tf_full_version = '2.9.0'
tf_full_py_version = 'py39'
def test_deploy_with_input_handlers(
sagemaker_session, instance_type, tf_full_version, tf_full_py_version
):
estimator = TensorFlow(
entry_point="training.py",
source_dir=TFS_RESOURCE_PATH,
role=ROLE,
instance_count=1,
instance_type=instance_type,
framework_version=tf_full_version,
py_version=tf_full_py_version,
sagemaker_session=sagemaker_session,
tags=TAGS,
)
> estimator.fit(job_name=unique_name_from_base("test-tf-tfs-deploy"))
tests/integ/test_tf.py:308:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
.tox/py39/lib/python3.9/site-packages/sagemaker/workflow/pipeline_context.py:209: in wrapper
return run_func(*args, **kwargs)
.tox/py39/lib/python3.9/site-packages/sagemaker/estimator.py:1006: in fit
self.latest_training_job = _TrainingJob.start_new(self, inputs, experiment_config)
.tox/py39/lib/python3.9/site-packages/sagemaker/estimator.py:1893: in start_new
train_args = cls._get_train_args(estimator, inputs, experiment_config)
.tox/py39/lib/python3.9/site-packages/sagemaker/estimator.py:1980: in _get_train_args
train_args["image_uri"] = estimator.training_image_uri()
.tox/py39/lib/python3.9/site-packages/sagemaker/estimator.py:2962: in training_image_uri
return image_uris.get_training_image_uri(
.tox/py39/lib/python3.9/site-packages/sagemaker/image_uris.py:504: in get_training_image_uri
return retrieve(
.tox/py39/lib/python3.9/site-packages/sagemaker/workflow/utilities.py:197: in wrapper
return func(*args, **kwargs)
.tox/py39/lib/python3.9/site-packages/sagemaker/image_uris.py:154: in retrieve
version = _validate_version_and_set_if_needed(version, config, framework)
.tox/py39/lib/python3.9/site-packages/sagemaker/image_uris.py:326: in _validate_version_and_set_if_needed
_validate_arg(version, available_versions + aliased_versions, "{} version".format(framework))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
arg = '2.9.0'
available_options = ['1.10.0', '1.11.0', '1.12.0', '1.13.1', '1.14.0', '1.15.0', ...]
arg_name = 'tensorflow version'
def _validate_arg(arg, available_options, arg_name):
"""Checks if the arg is in the available options, and raises a ``ValueError`` if not."""
if arg not in available_options:
> raise ValueError(
"Unsupported {arg_name}: {arg}. You may need to upgrade your SDK version "
"(pip install -U sagemaker) for newer {arg_name}s. Supported {arg_name}(s): "
"{options}.".format(arg_name=arg_name, arg=arg, options=", ".join(available_options))
)
E ValueError: Unsupported tensorflow version: 2.9.0. You may need to upgrade your SDK version (pip install -U sagemaker) for newer tensorflow versions. Supported tensorflow version(s): 1.10.0, 1.11.0, 1.12.0, 1.13.1, 1.14.0, 1.15.0, 1.15.2, 1.15.3, 1.15.4, 1.15.5, 1.4.1, 1.5.0, 1.6.0, 1.7.0, 1.8.0, 1.9.0, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.4, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.3.2, 2.4.1, 2.4.3, 2.5.0, 2.5.1, 2.6.0, 2.6.2, 2.6.3, 2.7.1, 2.8.0, 2.9.1, 1.10, 1.11, 1.12, 1.13, 1.14, 1.15, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9.
So the test is trying to train with TF 2.9.0, and the SDK doesn't know about an image for that, and indeed none exists in the ECR repos. I think the tests are trying to run with 2.9.0 because this PR adds a 2.9.0 inference image.
To summarise (in us-west-2, repo 763104351884.dkr.ecr.us-west-2.amazonaws.com, and found the same results in eu-west-1):
| TF version | scope | image | known by SDK? | exists in ECR? |
|---|---|---|---|---|
| 2.9 | inference | tensorflow-inference:2.9-cpu |
:heavy_check_mark:* | :heavy_check_mark: |
| 2.9.0 | inference | tensorflow-inference:2.9.0-cpu |
:heavy_check_mark:* | :heavy_check_mark: |
| 2.9.1 | inference | tensorflow-inference:2.9.1-cpu |
:x: | :x: |
| 2.9 | training | tensorflow-training:2.9-cpu-py39 |
:heavy_check_mark: | :heavy_check_mark: |
| 2.9.0 | training | tensorflow-training:2.9.0-cpu-py39 |
:x: | :x: |
| 2.9.1 | training | tensorflow-training:2.9.1-cpu-py39 |
:heavy_check_mark: | :heavy_check_mark: |
(* = "added to the SDK by this PR")
Some questions/thoughts:
- What assumptions do the tests (or maybe the SDK more generally) make about version numbers between training and inference images?
- Is it that if version X has as a training image then it is assumed to also have an inference image? This would make some sense, but I don't this can be the case as CI passed on https://github.com/aws/sagemaker-python-sdk/pull/3156
- Is it that if version X has as an inference image then it is assumed to also have a training image? This appears to be the case and I think is why CI is failing on this PR.
- Why does Tensorflow image 2.9.1 exist for training but not for inference? Is this a bug (ie because for any training image there should be an exact version number match inference image and vice-versa?) or is this expected (ie because the patch version number doesn't really matter and is actually more like a build number?)?
- What should happen next? I don't think I can fix this just by making changes to the image uri config file. I think one of the following is needed:
- If image tags between training and inference are expected to match exactly, need for new image(s) to be pushed to the repositories. Probably a
tensorflow-inference:2.9.1-cputo be pushed to the repo in each region, and then I update the PR. I expect this option is the right one. - If image tags between training and inference are not expected to match exactly, need to make changes to this SDK (or just to the tests) to not make this assumption so that CI passes. I expect this option is not the right one.
- If image tags between training and inference are expected to match exactly, need for new image(s) to be pushed to the repositories. Probably a
Aha:
https://github.com/aws/sagemaker-python-sdk/blob/6f72e3cf40757f5b5669729d5b2fe3e5da5ae76c/tests/conftest.py#L380-L392
That assumes that the the minimum of the two latest versions for training and inference will exist. So, in this case, is finding 2.9.1 for training, 2.9.0 for inference and assuming that 2.9.0 will exist for both, but it does not.
So the tests need a version that exists for both. So need do one of
- "backfill" the training image by pushing a 2.9.0.
- release an inference 2.9.1 image
- rework the tests and the SDK to handle a mismatch between the TF versions
- rework that test fixture to add in "if the minimum of the two versions is 2.9.0, actually return 2.9"
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: f47e3f3e2c3a74f53f7283b501908eaae901b284
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: f47e3f3e2c3a74f53f7283b501908eaae901b284
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: e0274dc2c45aee037526e7064f000409c13ab231
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: e0274dc2c45aee037526e7064f000409c13ab231
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: e0274dc2c45aee037526e7064f000409c13ab231
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: e0274dc2c45aee037526e7064f000409c13ab231
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository