nvidia_p2p_get_pages(): Fix double-free in register-callback error path
Double-free in rm_p2p_register_callback() error-path in nv_p2p_get_pages() causes memory corruption that leads to a kernel panic.
Fix this by adding a separate goto for this error path that skips freeing the already-freed memory.
Double-free can be produced by calling nvidia_p2p_get_pages() on one CPU while simultaneously freeing the GPU virtual address range passed into nvidia_p2p_get_pages() on another CPU. Producing the double-free is timing dependent and may require multiple tries.
'slub_debug=FZ' kernel boot parameter shows the double-free:
[ 239.115091] ============================================================================= [ 239.124659] BUG kmalloc-16 (Tainted: G OE ): Object already free [ 239.133011] -----------------------------------------------------------------------------
[ 239.144491] Slab 0xfffffa8bc4434140 objects=85 used=82 fp=0xffff9a3dd0d05910 flags=0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff) [ 239.158997] Object 0xffff9a3dd0d05670 @offset=1648 fp=0x0000000000000000
[ 239.168766] Redzone ffff9a3dd0d05660: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb ................ [ 239.179633] Object ffff9a3dd0d05670: 10 00 00 00 00 00 00 00 e5 04 3f 13 96 18 8e 47 ..........?....G [ 239.190641] Redzone ffff9a3dd0d05680: bb bb bb bb bb bb bb bb ........ [ 239.200739] Padding ffff9a3dd0d05688: 84 80 0e 00 00 00 00 00 ........ [ 239.210938] CPU: 0 PID: 3150 Comm: hfi-sdma-test Kdump: loaded Tainted: G OE 6.5.0-rc1+ #1 [ 239.221911] Hardware name: Intel Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.1029.090220201031 09/02/2020 [ 239.233948] Call Trace: [ 239.236992] <TASK> [ 239.239608] dump_stack_lvl+0x33/0x50 [ 239.244010] object_err+0x3a/0x80 [ 239.248014] free_debug_processing+0x265/0x360 [ 239.253392] ? nv_p2p_get_pages+0x163/0x590 [nvidia] [ 239.259399] free_to_partial_list+0x80/0x280 [ 239.264478] ? nv_p2p_get_pages+0x163/0x590 [nvidia] [ 239.270426] nv_p2p_get_pages+0x163/0x590 [nvidia] [ 239.276303] ? __pfx_remove_nvidia_pages+0x10/0x10 [hfi1] [ 239.282692] nvidia_p2p_get_pages+0x25/0x40 [nvidia] [ 239.288601] ? __pfx_remove_nvidia_pages+0x10/0x10 [hfi1] ... [ 239.498990] </TASK> [ 239.501662] Disabling lock debugging due to kernel taint [ 239.507828] FIX kmalloc-16: Object at 0xffff9a3dd0d05670 not freed
Thanks for identifying this and the proposed fix, @BrendanCunningham. A variation of this fix will be included in a future release. I'll leave this open until then. Thanks.
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.
It looks like the bug that my PR fixed was fixed in commit be3cd9abcb1103115ae6c3c92d8fc4ff5c912f77.