rocm_smi_lib
rocm_smi_lib copied to clipboard
RSMI_STATUS_PERMISSION on rocm-smi --setmclk
- System: ubuntu-focal (5.4.0-109-generic)
- rocm-5.2.1
- GPU: MI250X
I am trying to set the memory clock frequency using rocm-smi, and it fails with the RSMI_STATUS_PERMISSION error. The performance level was set to manual:
$ rocm-smi --showhw
======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
GPU DID GFX RAS SDMA RAS UMC RAS VBIOS BUS
0 740c ENABLED ENABLED ENABLED 113-D65210-063 0000:31:00.0
1 740c ENABLED ENABLED ENABLED 113-D65210-063 0000:34:00.0
2 740c ENABLED ENABLED ENABLED 113-D65210-063 0000:11:00.0
3 740c ENABLED ENABLED ENABLED 113-D65210-063 0000:14:00.0
4 740c ENABLED ENABLED ENABLED 113-D65210-063 0000:AE:00.0
5 740c ENABLED ENABLED ENABLED 113-D65210-063 0000:B3:00.0
6 740c ENABLED ENABLED ENABLED 113-D65210-063 0000:8E:00.0
7 740c ENABLED ENABLED ENABLED 113-D65210-063 0000:93:00.0
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --showclkfrq --showperflevel
======================= ROCm System Management Interface =======================
============================ Show Performance Level ============================
GPU[0] : Performance Level: manual
================================================================================
========================= Supported clock frequencies ==========================
GPU[0] :
GPU[0] : Supported fclk frequencies on GPU0
GPU[0] : 0: 0Mhz *
GPU[0] :
GPU[0] : Supported mclk frequencies on GPU0
GPU[0] : 0: 400Mhz
GPU[0] : 1: 700Mhz
GPU[0] : 2: 1200Mhz
GPU[0] : 3: 1600Mhz *
GPU[0] :
GPU[0] : Supported sclk frequencies on GPU0
GPU[0] : 0: 500Mhz
GPU[0] : 1: 1700Mhz *
GPU[0] :
GPU[0] : Supported socclk frequencies on GPU0
GPU[0] : 0: 666Mhz
GPU[0] : 1: 857Mhz
GPU[0] : 2: 1000Mhz
GPU[0] : 3: 1090Mhz *
GPU[0] : 4: 1333Mhz
GPU[0] :
--------------------------------------------------------------------------------
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 2
======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command. Often this is fixed by running as root (sudo).
ERROR: GPU[0] : Unable to set mclk bitmask to: 0x4
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 0
======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command. Often this is fixed by running as root (sudo).
ERROR: GPU[0] : Unable to set mclk bitmask to: 0x1
================================================================================
============================= End of ROCm SMI Log ==============================
I found only sclk is configurable. Is this expected, or did I miss anything? Thanks!
did u set the feature mask and performance to manual like ?
rocm-smi --setperflevel manual
sudo rocm-smi --setvc 2 1701 915 --autorespond y
sudo rocm-smi --setsrange 808 1740 --autorespond y
@rakataprime Thanks for your input. I've tried the feature mask, which I didn't set properly before. However, I still cannot change the memory clock frequency as I wish.
BTW, I found the following error when the amdgpu module is loaded (regardless of the kernel parameter ppfeature):
[ 14.070181] ------------[ cut here ]------------
[ 14.070182] RAS ERROR: unexpected block id 15
[ 14.070285] WARNING: CPU: 0 PID: 5 at /var/lib/dkms/amdgpu/5.16.9.22.20-1447096~20.04/build/amd/amdgpu/amdgpu_ras.h:579 amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[ 14.070285] Modules linked in: crc32_pclmul hid_generic ib_uverbs ib_core amdgpu(OE+) amd_iommu_v2 amdttm(OE) amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci drm usbhid nvme libahci i2c_algo_bit hid i40e nvme_core i2c_piix4 wmi
[ 14.070299] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G OE 5.4.0-109-generic #123-Ubuntu
[ 14.070300] Hardware name: Supermicro AS -4124GQ-TNMI/H12DGQ-NT6, BIOS 2.4 08/23/2022
[ 14.070309] Workqueue: events work_for_cpu_fn
[ 14.070366] RIP: 0010:amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[ 14.070368] Code: d9 63 59 00 01 e8 6c 88 50 ea 0f 0b 45 31 ff e9 79 ff ff ff 44 89 fe 48 c7 c7 80 c1 c2 c0 c6 05 b9 63 59 00 01 e8 4c 88 50 ea <0f> 0b 45 31 ff e9 ba fe ff ff 48 c7 c7 f8 c1 c2 c0 c6 05 9b 63 59
[ 14.070369] RSP: 0018:ffffab01c0287bb8 EFLAGS: 00010286
[ 14.070370] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000001f2a
[ 14.070371] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000247
[ 14.070371] RBP: ffffab01c0287be8 R08: 0000000000001f2a R09: 0000000000000004
[ 14.070372] R10: 0000000000000000 R11: 0000000000000001 R12: ffff94709362c400
[ 14.070372] R13: ffff9470801e0000 R14: ffffffffc0ccda20 R15: 000000000000000f
[ 14.070373] FS: 0000000000000000(0000) GS:ffff94710cc00000(0000) knlGS:0000000000000000
[ 14.070373] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 14.070374] CR2: 0000000000000000 CR3: 0000007d9c5a0005 CR4: 0000000000760ef0
[ 14.070374] PKRU: 55555554
[ 14.070374] Call Trace:
[ 14.070431] amdgpu_ras_feature_enable_on_boot+0x48/0xd0 [amdgpu]
[ 14.070489] ? sdma_v4_0_set_ecc_irq_state+0x61/0x70 [amdgpu]
[ 14.070537] amdgpu_ras_block_late_init+0x5c/0x1f0 [amdgpu]
[ 14.070592] ? amdgpu_irq_update+0x85/0xa0 [amdgpu]
[ 14.070640] ? amdgpu_irq_get+0x44/0x60 [amdgpu]
[ 14.070691] ? amdgpu_sdma_ras_late_init+0x7b/0xa0 [amdgpu]
[ 14.070739] amdgpu_ras_late_init+0x34/0x90 [amdgpu]
[ 14.070787] amdgpu_device_ip_late_init+0x7d/0x270 [amdgpu]
[ 14.070867] amdgpu_device_init.cold+0x16a3/0x1ea9 [amdgpu]
[ 14.070873] ? pci_read_config_word+0x27/0x40
[ 14.070922] amdgpu_driver_load_kms+0x1a/0x150 [amdgpu]
[ 14.070970] amdgpu_pci_probe+0x1ed/0x3f0 [amdgpu]
[ 14.070975] local_pci_probe+0x48/0x80
[ 14.070976] work_for_cpu_fn+0x1a/0x30
[ 14.070978] process_one_work+0x1eb/0x3b0
[ 14.070979] worker_thread+0x21e/0x400
[ 14.070981] kthread+0x104/0x140
[ 14.070982] ? process_one_work+0x3b0/0x3b0
[ 14.070983] ? kthread_park+0x90/0x90
[ 14.070989] ret_from_fork+0x22/0x40
[ 14.070990] ---[ end trace 7be76cc2cca5f417 ]---
@sandrain Apologies for the lack of response. Please check if your issue still exists with the latest ROCm 6.2. If not, please close the ticket. Thanks!