onnxruntime [ROCm] fix: obtain AMD GPU memory info through rocm

Description

Previously ROCMExecutionProvider uses hipMemGetInfo to obtain the sizes of total memory and available memory. However, this API has been broken since ROCm 5.7. In this PR, we use rocm_smi library instead of hipMemGetInfo.

Motivation and Context

hipMemGetInfo API has been broken since ROCm 5.7 and inference with ROCMExecutionProvider will lead to following errors:

HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total);

MIOpen has a brute-force fix for this (https://github.com/ROCm/MIOpen/blob/911e67189592c311374940493f2099f3abced60d/src/hip/handlehip.cpp#L72). Instead of hard-coding available memory to 16GB, I suppose we could obtain memory info through rocm_smi library as in this PR.

Jun 27 '24 08:06 hann-wang

@hann-wang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree [company="AMD"]

Jun 27 '24 08:06 hann-wang

@microsoft-github-policy-service agree company="your company"

@microsoft-github-policy-service agree company="AMD Inc."

Jun 27 '24 08:06 hann-wang

Good idea.

a9c86724ae64eb034c5ea017a8bf9d059182c245

From: Tianlei Wu @.> Sent: Friday, June 28, 2024 07:13 To: microsoft/onnxruntime @.> Cc: Hann Wang @.>; Mention @.> Subject: Re: [microsoft/onnxruntime] [ROCm] fix: obtain AMD GPU memory info through rocm_smi library (PR #21190)

How about login like the following:

const auto status = hipMemGetInfo(free, total);

if (status != hipSuccess){

ROCMSMI_CALL_THROW(rsmi_init(0));

ROCMSMI_CALL_THROW(rsmi_dev_memory_total_get(deviceId, RSMI_MEM_TYPE_VIS_VRAM, total));

ROCMSMI_CALL_THROW(rsmi_dev_memory_usage_get(deviceId, RSMI_MEM_TYPE_VIS_VRAM, &used));

*free= *total- used;

ROCMSMI_CALL_THROW(rsmi_shut_down());

}

— Reply to this email directly, view it on GitHubhttps://github.com/microsoft/onnxruntime/pull/21190#issuecomment-2195811609, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACAVPJAO2ORD5AXOPONAGHLZJSL7XAVCNFSM6AAAAABJ7N7PBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVHAYTCNRQHE. You are receiving this because you were mentioned.Message ID: @.@.>>

Jun 28 '24 04:06 hann-wang

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

Jun 28 '24 17:06 tianleiwu

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

Jun 28 '24 17:06 tianleiwu

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Jun 28 '24 17:06 tianleiwu

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

Jun 28 '24 17:06 azure-pipelines[bot]

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

Jun 28 '24 17:06 azure-pipelines[bot]

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

Jun 28 '24 17:06 azure-pipelines[bot]

@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like

pip install -r requirements-lintrunner.txt
pip install lintrunner
lintrunner init
lintrunner -a

Jun 28 '24 17:06 tianleiwu

/azp run orttraining-amd-gpu-ci-pipeline

Jun 28 '24 17:06 tianleiwu

Azure Pipelines successfully started running 1 pipeline(s).

Jun 28 '24 17:06 azure-pipelines[bot]

@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like
pip install -r requirements-lintrunner.txt
pip install lintrunner
lintrunner init
lintrunner -a

got it, thank you!

9058961dbbc452418970f88e7633a0d2fe8910b8

Jul 01 '24 01:07 hann-wang

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

Jul 01 '24 04:07 tianleiwu

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

Jul 01 '24 04:07 tianleiwu

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

Jul 01 '24 04:07 tianleiwu

Azure Pipelines successfully started running 3 pipeline(s).

Jul 01 '24 04:07 azure-pipelines[bot]

Azure Pipelines successfully started running 10 pipeline(s).

Jul 01 '24 04:07 azure-pipelines[bot]

Azure Pipelines successfully started running 10 pipeline(s).

Jul 01 '24 04:07 azure-pipelines[bot]

onnxruntime onnxruntime copied to clipboard

[ROCm] fix: obtain AMD GPU memory info through rocm_smi library

Description

Motivation and Context

onnxruntime
onnxruntime copied to clipboard