sagemaker-python-sdk
sagemaker-python-sdk copied to clipboard
fix: make sure gpus are found in local_gpu run
Description of changes:
when running sagemaker in local_gpu mode it does not find the GPUs. The following config:
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]
This will result in no GPUs being found (Docker Compose version v2.24.1).
Specifying count: all in the docker-compose.yml solves this issue, and shouldn't change the behaviour (according to docker compose doc this should be the default behaviour https://docs.docker.com/compose/gpu-support/).
deploy:
resources:
reservations:
devices:
- count: all
capabilities: [gpu]
Testing done:
Running without this change, no GPUs are found, with this change they are.
Merge Checklist
Put an x in the boxes that apply. You can also fill these out after creating the PR. If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your pull request.
General
- [x] I have read the CONTRIBUTING doc
- [x] I certify that the changes I am introducing will be backward compatible, and I have discussed concerns about this, if any, with the Python SDK team
- [x] I used the commit message format described in CONTRIBUTING
- [ ] I have passed the region in to all S3 and STS clients that I've initialized as part of this change.
- [ ] I have updated any necessary documentation, including READMEs and API docs (if appropriate)
Tests
- [ ] I have added tests that prove my fix is effective or that my feature works (if appropriate)
- [ ] I have added unit and/or integration tests as appropriate to ensure backward compatibility of the changes
- [ ] I have checked that my tests are not configured for a specific region or account (if appropriate)
- [ ] I have used
unique_name_from_baseto create resource names in integ tests (if appropriate)
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Hi, thanks! This is a long-standing issue for which I had a suggestion in another PR as well but was never merged.
Hope this gets merged and a new version is released soon. Local mode with GPU was broken for such a long time.
Tagging more people @makungaj1 @jmahlik
Tagging more people
@ozancaglayan I'm not a maintainer in this repo and don't have gpus to test it on at the moment. The change seems reasonable to me.
I have had good luck getting PR's reviewed and merged with using the request review button (@gverkes would have to do it if available) and making sure it's attached to an open issue. I'm not exactly sure how the review requests auto-assign but they seem to round robin to the maintainers.
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: ba60af5fb8baca1eadc594fa00d4ca91a92ce922
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: ba60af5fb8baca1eadc594fa00d4ca91a92ce922
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: 3331794b3512af6f2dea89d715d856180228f9fd
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: 3331794b3512af6f2dea89d715d856180228f9fd
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: 3331794b3512af6f2dea89d715d856180228f9fd
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: 3331794b3512af6f2dea89d715d856180228f9fd
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: 3331794b3512af6f2dea89d715d856180228f9fd
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
@gverkes can you run tox -e black-format to fix the failing unit tests and update the PR.
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: 3331794b3512af6f2dea89d715d856180228f9fd
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: 6374b90a794598fc5c871e3708859b2b5e6a73a5
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: 6374b90a794598fc5c871e3708859b2b5e6a73a5
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: 6374b90a794598fc5c871e3708859b2b5e6a73a5
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: 6374b90a794598fc5c871e3708859b2b5e6a73a5
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: 6374b90a794598fc5c871e3708859b2b5e6a73a5
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 87.25%. Comparing base (
8b206ba) to head (1cefc73). Report is 56 commits behind head on master.
Additional details and impacted files
@@ Coverage Diff @@
## master #4384 +/- ##
==========================================
+ Coverage 86.94% 87.25% +0.31%
==========================================
Files 1203 388 -815
Lines 107211 36255 -70956
==========================================
- Hits 93211 31635 -61576
+ Misses 14000 4620 -9380
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hi @gverkes - seems like a unit test is failing:
=================================== FAILURES ===================================
________________________ test_container_has_gpu_support ________________________
tmpdir = local('/tmp/pytest-of-root/pytest-2/test_container_has_gpu_support0')
sagemaker_session = <sagemaker.session.Session object at 0x7f351ee197f0>
def test_container_has_gpu_support(tmpdir, sagemaker_session):
instance_count = 1
image = "my-image"
sagemaker_container = _SageMakerContainer(
"local_gpu", instance_count, image, sagemaker_session=sagemaker_session
)
docker_host = sagemaker_container._create_docker_host("host-1", {}, set(), "train", [])
assert "deploy" in docker_host
> assert docker_host["deploy"] == {
"resources": {"reservations": {"devices": [{"capabilities": ["gpu"]}]}}
}
E AssertionError: assert {'resources':...t': 'all'}]}}} == {'resources':...: ['gpu']}]}}}
E Differing items:
E {'resources': {'reservations': {'devices': [{'capabilities': ['gpu'], 'count': 'all'}]}}} != {'resources': {'reservations': {'devices': [{'capabilities': ['gpu']}]}}}
E Use -v to get the full diff
tests/unit/sagemaker/local/test_local_image.py:873: AssertionError
=============================== warnings summary ===============================
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: c33e9dae98ad4f5fa6670b9228228f3bde67536f
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: c33e9dae98ad4f5fa6670b9228228f3bde67536f
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: c33e9dae98ad4f5fa6670b9228228f3bde67536f
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: c33e9dae98ad4f5fa6670b9228228f3bde67536f
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-unit-tests
- Commit ID: c33e9dae98ad4f5fa6670b9228228f3bde67536f
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: c33e9dae98ad4f5fa6670b9228228f3bde67536f
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: c33e9dae98ad4f5fa6670b9228228f3bde67536f
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-notebook-tests
- Commit ID: 292955af000e45a903e055ea42f969d6f7f3bcb8
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-slow-tests
- Commit ID: 292955af000e45a903e055ea42f969d6f7f3bcb8
- Result: SUCCEEDED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-local-mode-tests
- Commit ID: 292955af000e45a903e055ea42f969d6f7f3bcb8
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository
AWS CodeBuild CI Report
- CodeBuild project: sagemaker-python-sdk-pr
- Commit ID: 292955af000e45a903e055ea42f969d6f7f3bcb8
- Result: FAILED
- Build Logs (available for 30 days)
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository