rocminfo broken inside container?
Not sure if this is a known feature/bug:
$ sudo docker run -it --device=/dev/kfd --device=/dev/dri --group-add video rocm/rocm-terminal
To run a command as administrator (user "root"), use "sudo <command>".
See "man sudo_root" for details.
rocm-user@98f7ad00522f:~$ rocminfo
hsa api call failure at line 900, file: /home/jenkins/jenkins-root/workspace/compute-rocm-rel-1.9/rocminfo/rocminfo.cc. Call returned 4104
rocm-user@98f7ad00522f:~$ rocm
rocm-smi rocm_agent_enumerator rocm_smi.py rocminfo
rocm-user@98f7ad00522f:~$ rocm-smi
==================== ROCm System Management Interface ====================
================================================================================
GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD MCLK OD
0 N/A N/A N/A N/A 0% N/A N/A N/A
1 26c N/A 852Mhz 167Mhz 0.0% auto 0% 0%
2 23c N/A 852Mhz 167Mhz 0.0% auto 0% 0%
3 23c N/A 852Mhz 167Mhz 0.0% auto 0% 0%
4 27c N/A 852Mhz 167Mhz 0.0% auto 0% 0%
================================================================================
==================== End of ROCm SMI Log ====================
I am running on
$ lsb_release -d
Description: Ubuntu 16.04.4 LTS
with
$ docker --version
Docker version 18.03.1-ce, build 9ee9f40
$ sudo docker images |grep rocm-term
rocm/rocm-terminal latest 1c2cc81e67e0 5 days ago 1.83GB
rocm/rocm-terminal 1.9.1 4cceff492469 2 months ago 1.83GB
similar issue here:
rocm-user@ea6068792841:~$ rocminfo
hsa api call failure at line 900, file: /data/jenkins_workspace/compute-rocm-rel-2.1/rocminfo/rocminfo.cc. Call returned 4104
rocm-user@ea6068792841:~$ rocm-smi
======================== ROCm System Management Interface ========================
================================================================================================
GPU Temp AvgPwr SCLK MCLK PCLK Fan Perf PwrCap SCLK OD MCLK OD GPU%
0 50.0c N/A 852Mhz 167Mhz 8.0GB, x16 16.86% auto N/A 0% 0% N/A
================================================================================================
======================== End of ROCm SMI Log ========================
rocm-user@ea6068792841:~$ lspci -n -d 1002:
08:00.0 0300: 1002:687f (rev c3)
08:00.1 0403: 1002:aaf8
system information:
➜ lsb_release -d
Description: Ubuntu 18.04.1 LTS
➜ docker --version
Docker version 18.09.1, build 4c52b90
➜ docker images |grep rocm-term
rocm/rocm-terminal latest 1fe2ba083882 2 days ago 1.91GB
What's the rock-dkms driver installed on your bare metal setup? Please provide the following logs from your bare metal setup: uname -a apt --installed list | grep rock-dkms dmesg | grep amdgpu
➜ uname -a
Linux gamma 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
➜ apt --installed list | grep rock-dkms
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
rock-dkms/Ubuntu 16.04,now 2.1-96 all [installed,automatic]
➜ dmesg | grep amdgpu
[ 1.676268] [drm] amdgpu kernel modesetting enabled.
[ 1.676268] [drm] amdgpu version: 19.10.7.418
[ 1.677432] fb: switching to amdgpudrmfb from EFI VGA
[ 1.683245] amdgpu 0000:08:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[ 1.683288] amdgpu 0000:08:00.0: VRAM: 8176M 0x000000F400000000 - 0x000000F5FEFFFFFF (8176M used)
[ 1.683291] amdgpu 0000:08:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 1.683293] amdgpu 0000:08:00.0: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[ 1.683370] [drm] amdgpu: 8176M of VRAM memory ready
[ 1.683372] [drm] amdgpu: 8176M of GTT memory ready.
[ 2.164082] amdgpu 0000:08:00.0: ring gfx uses VM inv eng 0 on hub 0
[ 2.164085] amdgpu 0000:08:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 2.164087] amdgpu 0000:08:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 2.164089] amdgpu 0000:08:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 2.164091] amdgpu 0000:08:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 2.164093] amdgpu 0000:08:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 2.164095] amdgpu 0000:08:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 2.164097] amdgpu 0000:08:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 2.164099] amdgpu 0000:08:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 2.164101] amdgpu 0000:08:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 2.164103] amdgpu 0000:08:00.0: ring sdma0 uses VM inv eng 0 on hub 1
[ 2.164105] amdgpu 0000:08:00.0: ring page0 uses VM inv eng 1 on hub 1
[ 2.164106] amdgpu 0000:08:00.0: ring sdma1 uses VM inv eng 4 on hub 1
[ 2.164108] amdgpu 0000:08:00.0: ring page1 uses VM inv eng 5 on hub 1
[ 2.164110] amdgpu 0000:08:00.0: ring uvd_0 uses VM inv eng 6 on hub 1
[ 2.164112] amdgpu 0000:08:00.0: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
[ 2.164114] amdgpu 0000:08:00.0: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
[ 2.164116] amdgpu 0000:08:00.0: ring vce0 uses VM inv eng 9 on hub 1
[ 2.164118] amdgpu 0000:08:00.0: ring vce1 uses VM inv eng 10 on hub 1
[ 2.164120] amdgpu 0000:08:00.0: ring vce2 uses VM inv eng 11 on hub 1
[ 2.164974] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:08:00.0 on minor 0
@jiachengpan Thanks, those all look good.
Could you try to use sudo when querying rocminfo?
sudo /opt/rocm/bin/rocminfo
thanks! actually it looks rocminfo works well now in docker, w/ or w/o sudo. perhaps I upgraded rocm packages yesterday but never rebooted the machine...
I have the same problem, I was following the installation instuctions of pytorch docker setup.
I'm getting this error when I do sudo /opt/rocm/bin/rocminfo
hsa api call failure at line 900, file: /data/jenkins_workspace/compute-rocm-rel-2.4/rocminfo/rocminfo.cc. Call returned 4104
I'm running on Ubuntu 18.04.02,
And rocm configuration is as below
apt --installed list | grep rock-dkms
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.
rock-dkms/Ubuntu 16.04,now 2.4-25 all [installed]
Did you try rebooting your machine? Mine was resolved after reboot...
Yes @jiachengpan I tried rebooting
lsmod | grep kfd is not printing anything. I guess this might be the cause??
kfd is no longer its own separate module, so its part of amdgpu now. lsmod | grep amdgpu should return something.
This is the output I'm getting when I do lsmod | grep amdgpu
amdgpu 3506176 1
amdttm 98304 1 amdgpu
amd_sched 28672 1 amdgpu
amdkcl 24576 3 amd_sched,amdttm,amdgpu
amd_iommu_v2 20480 1 amdgpu
drm_kms_helper 167936 2 amdgpu,i915
drm 401408 22 drm_kms_helper,amd_sched,amdttm,amdgpu,i915,amdkcl
i2c_algo_bit 16384 2 amdgpu,i915
it seems lsmod is missing
rocm-user@1ddd69214eda:~$ sudo /opt/rocm/bin/rocminfo
sh: 1: lsmod: not found
ROCk module is NOT loaded, possibly no GPU devices
Failed to get user name to check for video group membership
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx
Marketing Name: AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32(0x20) KB
Chip ID: 5592(0x15d8)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2100
BDFID: 768
Internal Node ID: 0
Compute Unit: 8
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 8388224(0x7ffe80) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx902
Marketing Name: AMD Ryzen 5 3500U with Radeon Vega Mobile Gfx
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 4096(0x1000)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 0
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
Chip ID: 5592(0x15d8)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1200
BDFID: 768
Internal Node ID: 0
Compute Unit: 11
SIMDs per CU: 4
Shader Engines: 1
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 160(0xa0)
Max Work-item Per CU: 10240(0x2800)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Pool Info:
Pool 1
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx902+xnack
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***