Xin Wang

Results 6 issues of Xin Wang

*GitHub Issue #, if available:* **Note**: - If merging this PR should also close the associated Issue, please also add that Issue # to the Linked Issues section on the...

build
test
pytorch
ec2
Size:S
no-pr-activity

*Issue #, if available:* build test *Description of changes:* build test *Testing done:* build test ## Merge Checklist _Put an `x` in the boxes that apply. You can also fill...

*Issue #, if available:* *Description of changes:* 1/. Add training job TFlops calculator 2/. Add stuck job monitor, customer can monitor specific Cloudwatch log stream output specified. If customer did...

### What you would like to be added? ## Description We are proposing changes to enhance training job restart, that can help avoid restart failures and delays in case of...

kind/feature
lifecycle/needs-triage

Hi: In the release notes it is mentioned aibrix supports 'GPU Hardware Failure Detection: Proactive detection of GPU hardware issues'. There is a separate package https://github.com/aibrix/ai-accelerator-tool/tree/main/pkg/diagnose, but that seems a...

priority/critical-urgent
area/tools
area/community

*Issue #, if available:* *Description of changes:* Add HyperpodTrainingOperatorServiceRole, which is required by https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-eks-operator.html. By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this...