onnxruntime
onnxruntime copied to clipboard
[ROCm] fix: obtain AMD GPU memory info through rocm_smi library
Description
Previously ROCMExecutionProvider uses hipMemGetInfo to obtain the sizes of total memory and available memory. However, this API has been broken since ROCm 5.7. In this PR, we use rocm_smi library instead of hipMemGetInfo.
Motivation and Context
hipMemGetInfo API has been broken since ROCm 5.7 and inference with ROCMExecutionProvider will lead to following errors:
HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total);
MIOpen has a brute-force fix for this (https://github.com/ROCm/MIOpen/blob/911e67189592c311374940493f2099f3abced60d/src/hip/handlehip.cpp#L72). Instead of hard-coding available memory to 16GB, I suppose we could obtain memory info through rocm_smi library as in this PR.
@hann-wang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"Contributor License Agreement
@microsoft-github-policy-service agree [company="AMD"]
@microsoft-github-policy-service agree company="your company"
@microsoft-github-policy-service agree company="AMD Inc."
Good idea.
a9c86724ae64eb034c5ea017a8bf9d059182c245
From: Tianlei Wu @.> Sent: Friday, June 28, 2024 07:13 To: microsoft/onnxruntime @.> Cc: Hann Wang @.>; Mention @.> Subject: Re: [microsoft/onnxruntime] [ROCm] fix: obtain AMD GPU memory info through rocm_smi library (PR #21190)
How about login like the following:
const auto status = hipMemGetInfo(free, total);
if (status != hipSuccess){
ROCMSMI_CALL_THROW(rsmi_init(0));
ROCMSMI_CALL_THROW(rsmi_dev_memory_total_get(deviceId, RSMI_MEM_TYPE_VIS_VRAM, total));
ROCMSMI_CALL_THROW(rsmi_dev_memory_usage_get(deviceId, RSMI_MEM_TYPE_VIS_VRAM, &used));
*free= *total- used;
ROCMSMI_CALL_THROW(rsmi_shut_down());
}
— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/onnxruntime/pull/21190#issuecomment-2195811609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACAVPJAO2ORD5AXOPONAGHLZJSL7XAVCNFSM6AAAAABJ7N7PBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVHAYTCNRQHE. You are receiving this because you were mentioned.Message ID: @.@.>>
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline
/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline
Pipelines were unable to run due to time out waiting for the pull request to finish merging.
Pipelines were unable to run due to time out waiting for the pull request to finish merging.
Pipelines were unable to run due to time out waiting for the pull request to finish merging.
@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like
pip install -r requirements-lintrunner.txt
pip install lintrunner
lintrunner init
lintrunner -a
/azp run orttraining-amd-gpu-ci-pipeline
Azure Pipelines successfully started running 1 pipeline(s).
@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like
pip install -r requirements-lintrunner.txt pip install lintrunner lintrunner init lintrunner -a
got it, thank you!
9058961dbbc452418970f88e7633a0d2fe8910b8
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline
/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline
Azure Pipelines successfully started running 3 pipeline(s).
Azure Pipelines successfully started running 10 pipeline(s).
Azure Pipelines successfully started running 10 pipeline(s).