open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

nvidia_p2p_get_pages(): Fix double-free in register-callback error path

Open BrendanCunningham opened this issue 2 years ago • 2 comments

Double-free in rm_p2p_register_callback() error-path in nv_p2p_get_pages() causes memory corruption that leads to a kernel panic.

Fix this by adding a separate goto for this error path that skips freeing the already-freed memory.

Double-free can be produced by calling nvidia_p2p_get_pages() on one CPU while simultaneously freeing the GPU virtual address range passed into nvidia_p2p_get_pages() on another CPU. Producing the double-free is timing dependent and may require multiple tries.

'slub_debug=FZ' kernel boot parameter shows the double-free:

[ 239.115091] ============================================================================= [ 239.124659] BUG kmalloc-16 (Tainted: G OE ): Object already free [ 239.133011] -----------------------------------------------------------------------------

[ 239.144491] Slab 0xfffffa8bc4434140 objects=85 used=82 fp=0xffff9a3dd0d05910 flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff) [ 239.158997] Object 0xffff9a3dd0d05670 @offset=1648 fp=0x0000000000000000

[ 239.168766] Redzone ffff9a3dd0d05660: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 239.179633] Object ffff9a3dd0d05670: 10 00 00 00 00 00 00 00 e5 04 3f 13 96 18 8e 47 ..........?....G [ 239.190641] Redzone ffff9a3dd0d05680: bb bb bb bb bb bb bb bb ........ [ 239.200739] Padding ffff9a3dd0d05688: 84 80 0e 00 00 00 00 00 ........ [ 239.210938] CPU: 0 PID: 3150 Comm: hfi-sdma-test Kdump: loaded Tainted: G OE 6.5.0-rc1+ #1 [ 239.221911] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.1029.090220201031 09/02/2020 [ 239.233948] Call Trace: [ 239.236992] <TASK> [ 239.239608] dump_stack_lvl+0x33/0x50 [ 239.244010] object_err+0x3a/0x80 [ 239.248014] free_debug_processing+0x265/0x360 [ 239.253392] ? nv_p2p_get_pages+0x163/0x590 [nvidia] [ 239.259399] free_to_partial_list+0x80/0x280 [ 239.264478] ? nv_p2p_get_pages+0x163/0x590 [nvidia] [ 239.270426] nv_p2p_get_pages+0x163/0x590 [nvidia] [ 239.276303] ? __pfx_remove_nvidia_pages+0x10/0x10 [hfi1] [ 239.282692] nvidia_p2p_get_pages+0x25/0x40 [nvidia] [ 239.288601] ? __pfx_remove_nvidia_pages+0x10/0x10 [hfi1] ... [ 239.498990] </TASK> [ 239.501662] Disabling lock debugging due to kernel taint [ 239.507828] FIX kmalloc-16: Object at 0xffff9a3dd0d05670 not freed

BrendanCunningham avatar Sep 11 '23 14:09 BrendanCunningham

Thanks for identifying this and the proposed fix, @BrendanCunningham. A variation of this fix will be included in a future release. I'll leave this open until then. Thanks.

aritger avatar Oct 03 '23 17:10 aritger

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Jun 06 '24 06:06 CLAassistant

It looks like the bug that my PR fixed was fixed in commit be3cd9abcb1103115ae6c3c92d8fc4ff5c912f77.

BrendanCunningham avatar Sep 03 '25 13:09 BrendanCunningham