ndctl icon indicating copy to clipboard operation
ndctl copied to clipboard

concurrent invocations of ndctl can cause linux panic

Open etsaur4 opened this issue 6 years ago • 5 comments

Raising a discussion on the linux-nvdimm alias to be tracked as a github issue.

https://lists.01.org/pipermail/linux-nvdimm/2019-May/021385.html

The problem still exists in 5.2 RC2

The problem is fairly easy to reproduce in as little as 10 minutes. Do the following in parallel, like in separate terminals. Example... in term #1, #3, #5, type while [1]; do ndctl create-namespace -m devdax -s 48G done in term #2, #4, #6, type while [1]; do ndctl destroy-namespace all -f done

Even simple invocation will eventually lead to a panic, it can take hours though. Example... in term #1 run the script #/bin/bash while /bin/true do ndctl destroy-namespace -f all date for R in ndctl list -R | jq -r ".[] | .dev" do for i in {1..10} do ndctl create-namespace -r $R -s 8g -m devdax done done done in term #2 type while /bin/true; do ndctl list done

Run that same terminal #1 script in 2 separate terminals, thereby creating 2 separate threads that will destroy/create will usually result in a panic within an hour.

etsaur4 avatar Jun 06 '19 21:06 etsaur4

Update with 5.2 RC2 + patches like the one for issue 91 also exhibit the problem. Same stack as the one in the nvdimm alias.

[ 376.581650] CPU: 20 PID: 1950 Comm: kworker/u130:14 Not tainted 4.14.35-1923.el7uek.x86_64 #2 [ 376.591165] Hardware name: Oracle Corporation ORACLE SERVER X8-2/ASM, MB, X7-2, BIOS 51020101 05/07/2019 [ 376.601755] Workqueue: events_unbound async_run_entry_fn [ 376.607683] task: ffff9e78fa63bd80 task.stack: ffffc2348fb74000 [ 376.614292] RIP: 0010:kernfs_find_ns+0x18/0xbf [ 376.619250] RSP: 0018:ffffc2348fb77d20 EFLAGS: 00010246 [ 376.625081] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff [ 376.633045] RDX: 0000000000000000 RSI: ffffffffa8eb5ac1 RDI: 0000000000000000 [ 376.641010] RBP: ffffc2348fb77d40 R08: 0000000000000000 R09: ffff9e61f9f48000 [ 376.648973] R10: 000000000000005c R11: 00000000000000a6 R12: ffffffffa8eb5ac1 [ 376.656938] R13: 0000000000000000 R14: ffffffffa8eb5ac1 R15: ffff9e7905fad208 [ 376.664902] FS: 0000000000000000(0000) GS:ffff9e791ef00000(0000) knlGS:0000000000000000 [ 376.673933] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 376.680347] CR2: 0000000000000070 CR3: 000000156b40a002 CR4: 00000000007606e0 [ 376.688311] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 376.696273] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 376.704238] PKRU: 55555554 [ 376.707255] Call Trace: [ 376.709987] kernfs_find_and_get_ns+0x31/0x52 [ 376.714848] sysfs_unmerge_group+0x1d/0x57 [ 376.719422] dpm_sysfs_remove+0x22/0x5c [ 376.723706] device_del+0x5a/0x325 [ 376.727502] device_unregister+0x1a/0x58 [ 376.731886] nd_async_device_unregister+0x22/0x30 [libnvdimm] [ 376.738299] async_run_entry_fn+0x3e/0x169 [ 376.742870] process_one_work+0x169/0x3a6 [ 376.747345] worker_thread+0x4d/0x3e5 [ 376.751434] kthread+0x105/0x138 [ 376.755035] ? rescuer_thread+0x380/0x375 [ 376.759510] ? kthread_bind+0x20/0x15 [ 376.763600] ret_from_fork+0x24/0x49 [ 376.767588] Code: 24 08 48 83 42 40 01 5b 41 5c 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 56 49 89 f6 41 55 49 89 d5 31 d2 41 54 53 <0f> b7 47 70 48 8b 5f 48 66 c1 e8 05 83 e0 01 4d 85 ed 0f b6 c8 [ 376.788686] RIP: kernfs_find_ns+0x18/0xbf RSP: ffffc2348fb77d20 [ 376.795293] CR2: 0000000000000070

etsaur4 avatar Jun 07 '19 22:06 etsaur4

I'm able to readily reproduce this. Concurrent ndctl seems to be triggering double-free (double device-unregistration events). Still looking to narrow down all the scenarios where double unregistration occurs.

djbw avatar Jun 11 '19 05:06 djbw

Proposed fixes here: https://lists.01.org/pipermail/linux-nvdimm/2019-June/021847.html

djbw avatar Jun 11 '19 23:06 djbw

Also pushed out to libnvdimm-pending: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=libnvdimm-pending

djbw avatar Jun 11 '19 23:06 djbw

This should be fixed with recent Linux versions (such as 5.15)

hramrach avatar Jan 14 '22 14:01 hramrach