rocm_smi_lib icon indicating copy to clipboard operation
rocm_smi_lib copied to clipboard

RSMI_STATUS_PERMISSION on rocm-smi --setmclk

Open sandrain opened this issue 2 years ago • 7 comments

  • System: ubuntu-focal (5.4.0-109-generic)
  • rocm-5.2.1
  • GPU: MI250X

I am trying to set the memory clock frequency using rocm-smi, and it fails with the RSMI_STATUS_PERMISSION error. The performance level was set to manual:

$ rocm-smi --showhw


======================= ROCm System Management Interface =======================
============================ Concise Hardware Info =============================
GPU  DID   GFX RAS  SDMA RAS  UMC RAS  VBIOS           BUS
0    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:31:00.0
1    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:34:00.0
2    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:11:00.0
3    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:14:00.0
4    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:AE:00.0
5    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:B3:00.0
6    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:8E:00.0
7    740c  ENABLED  ENABLED   ENABLED  113-D65210-063  0000:93:00.0
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --showclkfrq --showperflevel


======================= ROCm System Management Interface =======================
============================ Show Performance Level ============================
GPU[0]          : Performance Level: manual
================================================================================
========================= Supported clock frequencies ==========================
GPU[0]          :
GPU[0]          : Supported fclk frequencies on GPU0
GPU[0]          : 0: 0Mhz *
GPU[0]          :
GPU[0]          : Supported mclk frequencies on GPU0
GPU[0]          : 0: 400Mhz
GPU[0]          : 1: 700Mhz
GPU[0]          : 2: 1200Mhz
GPU[0]          : 3: 1600Mhz *
GPU[0]          :
GPU[0]          : Supported sclk frequencies on GPU0
GPU[0]          : 0: 500Mhz
GPU[0]          : 1: 1700Mhz *
GPU[0]          :
GPU[0]          : Supported socclk frequencies on GPU0
GPU[0]          : 0: 666Mhz
GPU[0]          : 1: 857Mhz
GPU[0]          : 2: 1000Mhz
GPU[0]          : 3: 1090Mhz *
GPU[0]          : 4: 1333Mhz
GPU[0]          :
--------------------------------------------------------------------------------
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 2


======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command.  Often this is fixed by running as root (sudo).
ERROR: GPU[0]           : Unable to set mclk bitmask to: 0x4
================================================================================
============================= End of ROCm SMI Log ==============================
$ sudo rocm-smi -d 0 --setmclk 0


======================= ROCm System Management Interface =======================
============================== Set mclk Frequency ==============================
ERROR: 4 GPU[0]:RSMI_STATUS_PERMISSION: The user ID of the calling process does not have sufficient permission to execute a command.  Often this is fixed by running as root (sudo).
ERROR: GPU[0]           : Unable to set mclk bitmask to: 0x1
================================================================================
============================= End of ROCm SMI Log ==============================

I found only sclk is configurable. Is this expected, or did I miss anything? Thanks!

sandrain avatar Jan 05 '23 14:01 sandrain

did u set the feature mask and performance to manual like ? rocm-smi --setperflevel manual sudo rocm-smi --setvc 2 1701 915 --autorespond y sudo rocm-smi --setsrange 808 1740 --autorespond y

rakataprime avatar Jan 19 '23 20:01 rakataprime

@rakataprime Thanks for your input. I've tried the feature mask, which I didn't set properly before. However, I still cannot change the memory clock frequency as I wish.

BTW, I found the following error when the amdgpu module is loaded (regardless of the kernel parameter ppfeature):

[   14.070181] ------------[ cut here ]------------
[   14.070182] RAS ERROR: unexpected block id 15
[   14.070285] WARNING: CPU: 0 PID: 5 at /var/lib/dkms/amdgpu/5.16.9.22.20-1447096~20.04/build/amd/amdgpu/amdgpu_ras.h:579 amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[   14.070285] Modules linked in: crc32_pclmul hid_generic ib_uverbs ib_core amdgpu(OE+) amd_iommu_v2 amdttm(OE) amd_sched(OE) amdkcl(OE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci drm usbhid nvme libahci i2c_algo_bit hid i40e nvme_core i2c_piix4 wmi
[   14.070299] CPU: 0 PID: 5 Comm: kworker/0:0 Tainted: G           OE     5.4.0-109-generic #123-Ubuntu
[   14.070300] Hardware name: Supermicro AS -4124GQ-TNMI/H12DGQ-NT6, BIOS 2.4 08/23/2022
[   14.070309] Workqueue: events work_for_cpu_fn
[   14.070366] RIP: 0010:amdgpu_ras_feature_enable+0x1b4/0x210 [amdgpu]
[   14.070368] Code: d9 63 59 00 01 e8 6c 88 50 ea 0f 0b 45 31 ff e9 79 ff ff ff 44 89 fe 48 c7 c7 80 c1 c2 c0 c6 05 b9 63 59 00 01 e8 4c 88 50 ea <0f> 0b 45 31 ff e9 ba fe ff ff 48 c7 c7 f8 c1 c2 c0 c6 05 9b 63 59
[   14.070369] RSP: 0018:ffffab01c0287bb8 EFLAGS: 00010286
[   14.070370] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000001f2a
[   14.070371] RDX: 0000000000000001 RSI: 0000000000000082 RDI: 0000000000000247
[   14.070371] RBP: ffffab01c0287be8 R08: 0000000000001f2a R09: 0000000000000004
[   14.070372] R10: 0000000000000000 R11: 0000000000000001 R12: ffff94709362c400
[   14.070372] R13: ffff9470801e0000 R14: ffffffffc0ccda20 R15: 000000000000000f
[   14.070373] FS:  0000000000000000(0000) GS:ffff94710cc00000(0000) knlGS:0000000000000000
[   14.070373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.070374] CR2: 0000000000000000 CR3: 0000007d9c5a0005 CR4: 0000000000760ef0
[   14.070374] PKRU: 55555554
[   14.070374] Call Trace:
[   14.070431]  amdgpu_ras_feature_enable_on_boot+0x48/0xd0 [amdgpu]
[   14.070489]  ? sdma_v4_0_set_ecc_irq_state+0x61/0x70 [amdgpu]
[   14.070537]  amdgpu_ras_block_late_init+0x5c/0x1f0 [amdgpu]
[   14.070592]  ? amdgpu_irq_update+0x85/0xa0 [amdgpu]
[   14.070640]  ? amdgpu_irq_get+0x44/0x60 [amdgpu]
[   14.070691]  ? amdgpu_sdma_ras_late_init+0x7b/0xa0 [amdgpu]
[   14.070739]  amdgpu_ras_late_init+0x34/0x90 [amdgpu]
[   14.070787]  amdgpu_device_ip_late_init+0x7d/0x270 [amdgpu]
[   14.070867]  amdgpu_device_init.cold+0x16a3/0x1ea9 [amdgpu]
[   14.070873]  ? pci_read_config_word+0x27/0x40
[   14.070922]  amdgpu_driver_load_kms+0x1a/0x150 [amdgpu]
[   14.070970]  amdgpu_pci_probe+0x1ed/0x3f0 [amdgpu]
[   14.070975]  local_pci_probe+0x48/0x80
[   14.070976]  work_for_cpu_fn+0x1a/0x30
[   14.070978]  process_one_work+0x1eb/0x3b0
[   14.070979]  worker_thread+0x21e/0x400
[   14.070981]  kthread+0x104/0x140
[   14.070982]  ? process_one_work+0x3b0/0x3b0
[   14.070983]  ? kthread_park+0x90/0x90
[   14.070989]  ret_from_fork+0x22/0x40
[   14.070990] ---[ end trace 7be76cc2cca5f417 ]---

sandrain avatar Jan 20 '23 15:01 sandrain

@sandrain Apologies for the lack of response. Please check if your issue still exists with the latest ROCm 6.2. If not, please close the ticket. Thanks!

ppanchad-amd avatar Aug 06 '24 19:08 ppanchad-amd