deep-learning-containers icon indicating copy to clipboard operation
deep-learning-containers copied to clipboard

[huggingface_pytorch] Update buildspec to re-release pt-1.9.1

Open philschmid opened this issue 2 years ago • 2 comments

GitHub Issue #, if available:

Update buildspec.yml to the content of buildspec-1-9-1.yml to re-release 1.9.1 images to fix sagemaker-inference log issue.

  • https://github.com/aws/sagemaker-huggingface-inference-toolkit/issues/61

the PR by default builds off of buildspec.yml. In order for you to build/test the 1.9.1 images in a PR, you need to make a temp commit to copy the contents of buildspec-1-9-1.yml into buildspec.yml

Description

Tests run

NOTE: If you are creating a PR for a new framework version, please ensure success of the standard, rc, and efa sagemaker remote tests by updating the dlc_developer_config.toml file:

  • [ ] Revision A: sagemaker_remote_tests = "standard"
  • [ ] Revision B: sagemaker_remote_tests = "rc"
  • [ ] Revision C: sagemaker_remote_tests = "efa"

Additionally, please run the sagemaker local tests in at least one revision:

  • [ ] sagemaker_local_tests = true

DLC image/dockerfile

Additional context

Label Checklist

  • [ ] I have added the project label for this PR (<project_name> or "Improvement")

PR Checklist

  • [ ] I've prepended PR tag with frameworks/job this applies to : [mxnet, tensorflow, pytorch] | [ei/neuron/graviton] | [build] | [test] | [benchmark] | [ec2, ecs, eks, sagemaker]
  • [ ] If the PR changes affects SM test, I've modified dlc_developer_config.toml in my PR branch by setting sagemaker_tests = true and efa_tests = true
  • [ ] If this PR changes existing code, the change fully backward compatible with pre-existing code. (Non backward-compatible changes need special approval.)
  • [ ] (If applicable) I've documented below the DLC image/dockerfile this relates to
  • [ ] (If applicable) I've documented below the tests I've run on the DLC image
  • [ ] (If applicable) I've reviewed the licenses of updated and new binaries and their dependencies to make sure all licenses are on the Apache Software Foundation Third Party License Policy Category A or Category B license list. See https://www.apache.org/legal/resolved.html.
  • [ ] (If applicable) I've scanned the updated and new binaries to make sure they do not have vulnerabilities associated with them.

Pytest Marker Checklist

  • [ ] (If applicable) I have added the marker @pytest.mark.model("<model-type>") to the new tests which I have added, to specify the Deep Learning model that is used in the test (use "N/A" if the test doesn't use a model)
  • [ ] (If applicable) I have added the marker @pytest.mark.integration("<feature-being-tested>") to the new tests which I have added, to specify the feature that will be tested
  • [ ] (If applicable) I have added the marker @pytest.mark.multinode(<integer-num-nodes>) to the new tests which I have added, to specify the number of nodes used on a multi-node test
  • [ ] (If applicable) I have added the marker @pytest.mark.processor(<"cpu"/"gpu"/"eia"/"neuron">) to the new tests which I have added, if a test is specifically applicable to only one processor type

EIA/NEURON/GRAVITON Testing Checklist

  • When creating a PR:
  • [ ] I've modified dlc_developer_config.toml in my PR branch by setting ei_mode = true, neuron_mode = true or graviton_mode = true

Benchmark Testing Checklist

  • When creating a PR:
  • [ ] I've modified dlc_developer_config.toml in my PR branch by setting benchmark_mode = true

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

philschmid avatar May 11 '22 13:05 philschmid

@philschmid there are 10 tests currently failing. Can you help debug the tests that are not "test_repo_anaconda_present"?

Three of the failing tests are "test_repo_anaconda_not_present" - this one @kevinyang8 can help debug. Since the training image is built on top of a released DLC, won't be able to fix the "training" image in this PR until the base DLC is re-released, so we should prioritize updating the PT 1.9 dockerfiles as well

arjkesh avatar May 12 '22 17:05 arjkesh

@philschmid To fix your "test_repo_anaconda_present" tests, you can merge this PR into your PR which makes the necessary changes to pass the test. We are currently prioritizing updating the PT 1.9 images with similar changes and releasing it. You can track the progress of that via this PR.

kevinyang8 avatar May 12 '22 22:05 kevinyang8

@philschmid Can you close the PR if it is not valid or move it to draft if it is not being worked currently?

tejaschumbalkar avatar Mar 22 '23 06:03 tejaschumbalkar