[Issue]: kmemleak "unreferenced object...(size 32)" reports for memory allocated in amdgpu_vm_update_range()
Problem Description
After running osu_bibw -m 256:256 D D across two nodes, I see "unreferenced object" memory leak reports like so:
unreferenced object 0xffff940d27c6aae0 (size 32):
comm "osu_bibw", pid 8915, jiffies 4556610972 (age 18089.050s)
hex dump (first 32 bytes):
0f 00 00 00 0f 00 00 00 00 6e d8 00 00 00 00 00 .........n......
2d 75 84 b1 14 be d7 30 00 00 00 00 00 00 00 00 -u.....0........
backtrace:
[<ffffffffa3382c35>] kmalloc_trace+0x25/0x90
[<ffffffffc0aa96a7>] amdgpu_vm_update_range+0x97/0x890 [amdgpu]
[<ffffffffc0aaa7ce>] amdgpu_vm_clear_freed+0xde/0x250 [amdgpu]
[<ffffffffc0cf5da9>] amdgpu_amdkfd_gpuvm_unmap_memory_from_gpu+0x169/0x230 [amdgpu]
[<ffffffffc0cbc3fc>] kfd_ioctl_unmap_memory_from_gpu+0xec/0x310 [amdgpu]
[<ffffffffc0cba396>] kfd_ioctl+0x376/0x4d0 [amdgpu]
[<ffffffffa346fc1d>] __x64_sys_ioctl+0x8d/0xc0
[<ffffffffa3c8dcac>] do_syscall_64+0x5c/0x90
[<ffffffffa3e000a6>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
In /sys/kernel/debug/kmemleak on both nodes. There are 2887 of these reports on one node and 2945 reports on the other node.
On the node that has 2887 "unreferenced object" reports, there are 2887 occurrences of amdgpu_vm_update_range in the kmemleak output.
On the other node that has 2940 "unreferenced object" reports, there are 2940 occurences of amdgpu_vm_update_range in the kmemleak output. The other 5 reports trace through nfs code and are all 16 bytes in size (size 16).
All of the other "unreferenced object" reports between the two nodes are 32 bytes in size (size 32).
I have not gone through every report but, given that the number of occurrences of amdgpu_vm_update_range matches the number of unreferenced object...(size 32): reports on both nodes, I strongly suspect that this is a repeating leak or small variants thereof.
This is running osu_bibw D D with Open MPI on top of a driver for our HPC interconnect card with support for sending packets generated from ROCm buffers using a DMA engine. No calls from our driver (hfi1) appear in either kmemleak report. There are nearly as many (2881 and 2935) occurrences of kfd_ioctl+ in both kmemleak files as there are unreferenced object occurrences so I suspect that these leaks are occurring under an ioctl from ROCm userspace into amdgpu.
These leaks do not seem to affect the stability or functionality of the system but I am doing short tests, one benchmark every few minutes to every few hours.
Operating System
Red hat Enterprise Linux 9.4 (Plow)
CPU
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
GPU
AMD Instinct MI100
ROCm Version
ROCm 6.2.0
ROCm Component
No response
Steps to Reproduce
Hardware prerequisites:
- Two nodes equipped with:
- At least one MI100 each
- At least one Cornelis Networks Omni-Path 100 Host Fabric Adapter each
- Connected two each other either back-to-back or via an Omni-Path switch
Software prerequisites:
- Open MPI 5.0.5 built with ROCm 6.2.0 support
- libfabric with OPX provider with ROCm SDMA support
- hfi1 with AMD SDMA support
- hfi1 with AMD SDMA support can be found here
- As root,
echo clear > /sys/kernel/debug/kmemleakon both nodes. - Run
osu_bibw -m 256:256 D Dacross two nodes. - As root,
echo scan > /sys/kernel/debug/kmemleakon both nodes. - As root,
cat /sys/kernel/debug/kmemleak > kmemleak-$(hostname)-256.txt. - On each node, run
dmesg -wTto monitor for when kmemleaks have been detected with a message like so:[Tue Oct 1 17:09:58 2024] kmemleak: 2980 new suspected memory leaks (see /sys/kernel/debug/kmemleak)This may take a few minutes. grep -Ec '^unreferenced object' kmemleak-*256.txt; make note of number of hits from each file.grep -c amdgpu_vm_update_range kmemleak-*-256.txt; note number of hits from each file and compare to hits from same file in step 6.- Expectation is that number of hits in step 7 will be same or close to number of hits in step 6 for the same file.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
[37mROCk module version 6.8.5 is loaded[0m
=====================
HSA System Attributes
=====================
Runtime Version: 1.14
Runtime Ext Version: 1.6
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
==========
*******
Agent 1
*******
Name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Uuid: CPU-XX
Marketing Name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3600
BDFID: 0
Internal Node ID: 0
Compute Unit: 44
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 32287400(0x1ecaaa8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 32287400(0x1ecaaa8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 32287400(0x1ecaaa8) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 2
*******
Name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Uuid: CPU-XX
Marketing Name: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 1
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 3600
BDFID: 0
Internal Node ID: 1
Compute Unit: 44
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Memory Properties:
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 33007252(0x1f7a694) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 33007252(0x1f7a694) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33007252(0x1f7a694) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
*******
Agent 3
*******
Name: gfx908
Uuid: GPU-95386651081add54
Marketing Name: AMD Instinct MI100
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 2
Device Type: GPU
Cache Info:
L1: 16(0x10) KB
L2: 8192(0x2000) KB
Chip ID: 29580(0x738c)
ASIC Revision: 1(0x1)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 1502
BDFID: 1280
Internal Node ID: 2
Compute Unit: 120
SIMDs per CU: 4
Shader Engines: 8
Shader Arrs. per Eng.: 1
WatchPts on Addr. Ranges:4
Coherent Host Access: FALSE
Memory Properties:
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 64(0x40)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 40(0x28)
Max Work-item Per CU: 2560(0xa00)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 67
SDMA engine uCode:: 18
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS: EXTENDED FINE GRAINED
Size: 33538048(0x1ffc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Recommended Granule:2048KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Recommended Granule:0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
Additional Information
kmemleak reports
Two other things I didn't note in my initial report:
-
I observed these problems on a 6.5.0 development kernel built with
CONFIG_DEBUG_KMEMLEAK=y; the driver code I pointed you to is meant to be built against a distro kernel (e.g. 5.14 in RHEL 9.4). Distro kernels may not be built withCONFIG_DEBUG_KMEMLEAK=y. -
I first observed this problem a few weeks back, I narrowed the likely culprit to
tlb_cb = kmalloc(sizeof(*tlb_cb), GFP_KERNEL);inamdgpu_vm_update_range()in the/usr/src/amdgpu-6.8.5-2009582.el9/on my MI100 nodes.It looks like that struct is supposed to be freed in
amdgpu_vm_tlb_seq_cb(). When I saw this problem, I noticed that bothamdgpu_vm_update_rangeandamdgpu_vm_tlb_seq_cbare in/sys/kernel/tracing/available_filter_functions.So I did
echo function > /sys/kernel/tracing/current_tracer, limited the function tracer to justamdgpu_vm_update_rangeandamdgpu_vm_tlb_seq_cb, and ran the reproducer.After running the reproducer, I saw many occurrences of
amdgpu_vm_update_rangebut no occurrences ofamdgpu_vm_tlb_seq_cbin/sys/kernel/tracing/tracefor either node.
Hi @BrendanCunningham. Internal ticket has been created to investigate your issue. Thanks!
Hi @BrendanCunningham Thanks for reporting the issue! This is curious for sure, and we will try our best to reproduce it. Meanwhile, a speculation regarding your investigation:
It looks like that struct is supposed to be freed in amdgpu_vm_tlb_seq_cb(). When I saw this problem, I noticed that both amdgpu_vm_update_range and amdgpu_vm_tlb_seq_cb are in /sys/kernel/tracing/available_filter_functions.
So I did echo function > /sys/kernel/tracing/current_tracer, limited the function tracer to just amdgpu_vm_update_range and amdgpu_vm_tlb_seq_cb, and ran the reproducer. After running the reproducer, I saw many occurrences of amdgpu_vm_update_range but no occurrences of amdgpu_vm_tlb_seq_cb in /sys/kernel/tracing/trace for either node.
So how this works is that at here tlb_cb is passed to amdgpu_vm_tlb_flush, then immediately set to NULL afterwards. However, amdgpu_vm_tlb_flush doesn't hold on to tlb_cb either; instead it only passes a reference to its member, &tlb_cb->cb, to the amdgpu_vm_tlb_seq_cb function here, which gets executed only when the dma fence get signaled. This means that technically nothing is pointing to tlb_cb at the end of the scope of amdgpu_vm_update_range. I am not exactly sure how kmalloc_trace is keeping track of memory leaks, but if it is doing it by reference counting, then it is likely going to set a false alarm at that point. My suggestion would be to run a longer test and see if there's any actual memory consumption building up over time.
Hope this helps. Thanks!
Hi @BrendanCunningham, seems like we are unable to reproduce your issue at the moment . Have you had a chance to run a longer test? Thanks!
No, I haven't run a longer test yet.
Hi @BrendanCunningham, I will be closing this issues for now due to inactivity. Please feel-free to reopen/post follow ups whenever you are ready. Thanks!