dcache icon indicating copy to clipboard operation
dcache copied to clipboard

nfs hangs with mv or rm

Open calestyo opened this issue 3 years ago • 5 comments

Hey.

Since #4622 remains unfixed I wrote something that's manually creating a JSON, which is then uploaded via a cron job:

rm -f /pnfs/lrz-muenchen.de/data/atlas/meta/storage-descriptor.json.new
if [ $? -ne 0 ]; then
        printf 'Error: Could not remove old SRR data.\n' >&2
        exit 1
fi


srr_script /pnfs/lrz-muenchen.de/data/atlas/meta/storage-descriptor.json.new
if [ $? -ne 0 ]; then
        printf 'Error: Could not generate SRR data.\n' >&2
        exit 1
fi


mv -f /pnfs/lrz-muenchen.de/data/atlas/meta/storage-descriptor.json.new /pnfs/lrz-muenchen.de/data/atlas/meta/storage-descriptor.json
if [ $? -ne 0 ]; then
        printf 'Error: Could not replace previous SRR data.\n' >&2
        exit 1
fi

I've seen for both, the rm and the mv the process to hang, the nfs domain, when tried to be killed, also freezes and becomes a zombie process.

The kernel (5.10.92) gives:

Apr 11 15:16:48 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:35: Session trunking failed for 129.187.131.53
Apr 11 15:16:48 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 15:16:48 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added
Apr 11 15:21:01 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:35: Session trunking failed for 129.187.131.53
Apr 11 15:21:01 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 15:21:01 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added
Apr 11 15:25:19 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:35: Session trunking failed for 129.187.131.53
Apr 11 15:25:19 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 15:25:19 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added
Apr 11 15:25:26 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:35: Session trunking failed for 129.187.131.53
Apr 11 15:25:26 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 15:25:26 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added
Apr 11 15:25:42 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:12: Session trunking failed for 129.187.131.18
Apr 11 15:25:42 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 15:25:42 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added
Apr 11 15:29:51 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:12: Session trunking failed for 129.187.131.18
Apr 11 15:29:51 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 15:29:51 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added
Apr 11 15:33:04 lcg-lrz-dcache0 kernel: nfs: server localhost not responding, still trying
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 120 seconds.
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:33:35 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:33:35 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:35:35 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 241 seconds.
Apr 11 15:35:35 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:35:35 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:35:35 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:35:35 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:35:35 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:35:35 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:35:35 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:35:35 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:35:35 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:35:35 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:35:35 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:35:36 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:35:36 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:35:36 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:35:36 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:35:36 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:35:36 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:35:36 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:35:36 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:35:36 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:35:36 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:35:36 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:35:36 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:35:36 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:35:36 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:35:36 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 362 seconds.
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:37:36 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:37:36 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 483 seconds.
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:39:37 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:39:37 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 604 seconds.
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:41:38 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:41:38 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 724 seconds.
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:43:39 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:43:39 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 845 seconds.
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:45:40 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:45:40 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:47:40 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 966 seconds.
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:47:40 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:47:40 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:47:40 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:47:40 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:47:41 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:47:41 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:47:41 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:47:41 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:47:41 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:47:41 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:47:41 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:47:41 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:47:41 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:47:41 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:47:41 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:47:41 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:47:41 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 1087 seconds.
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004000
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:49:41 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:49:41 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: INFO: task OncRpcSvc (34):32001 blocked for more than 1208 seconds.
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:       Not tainted 5.10.0-11-amd64 #1 Debian 5.10.92-1
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: task:OncRpcSvc (34)  state:D stack:    0 pid:32001 ppid:     1 flags:0x00004004
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: Call Trace:
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  __schedule+0x282/0x870
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  schedule+0x46/0xb0
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  rwsem_down_read_slowpath+0x18e/0x500
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  ? lookup_fast+0xf2/0x150
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  walk_component+0x145/0x1d0
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  ? path_init+0x2bd/0x3e0
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  path_lookupat+0x73/0x1c0
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  filename_lookup+0xaa/0x1b0
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  ? strncpy_from_user+0x4e/0x140
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  ? getname_flags.part.0+0x45/0x1a0
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  vfs_statx+0x74/0x130
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  __do_sys_newlstat+0x39/0x70
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  ? exit_to_user_mode_prepare+0x6e/0x120
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  do_syscall_64+0x33/0x80
Apr 11 15:51:42 lcg-lrz-dcache0 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: RIP: 0033:0x7f44b221c446
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: RSP: 002b:00007f42c31f0928 EFLAGS: 00000246 ORIG_RAX: 0000000000000006
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: RAX: ffffffffffffffda RBX: 00007f42c31f2ab0 RCX: 00007f44b221c446
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: RDX: 00007f42c31f0970 RSI: 00007f42c31f0970 RDI: 00007f42c31f1ab0
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: RBP: 00007f42c31f0a40 R08: 0000000000000001 R09: 00007f4298054d46
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: R10: 00000000005b8092 R11: 0000000000000246 R12: 00007f42c31f1ab0
Apr 11 15:51:42 lcg-lrz-dcache0 kernel: R13: 00007f4298054d61 R14: 00007f42c31f1af2 R15: 00007f4298054d62
Apr 11 15:53:52 lcg-lrz-dcache0 kernel: nfs: server localhost not responding, timed out
Apr 11 15:53:58 lcg-lrz-dcache0 kernel: nfs: server localhost OK

...killed the rm process here,... and stared the nfs domain again...

Apr 11 15:55:53 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:35: Session trunking failed for 129.187.131.53
Apr 11 15:55:53 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 15:55:53 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added
Apr 11 15:56:55 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:35: Session trunking failed for 129.187.131.53
Apr 11 15:56:55 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 15:56:55 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added
Apr 11 16:00:01 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:35: Session trunking failed for 129.187.131.53
Apr 11 16:00:01 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 16:00:01 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added

this time with mv.

Interestingly, while it hangs, I cannot ls /pnfs/lrz-muenchen.de/data/atlas/meta/ but I can ls other dirs.

Cheers, Chris.

calestyo avatar Apr 11 '22 14:04 calestyo

Removing the old file first, instead of overwriting it with mv, makes the problem go away.

Though at least once, I saw the issue also just on the first rm.

calestyo avatar Apr 11 '22 14:04 calestyo

Oh and even if it works then (because of the above workaround) I still get:

Apr 11 16:07:54 lcg-lrz-dcache0 kernel: NFS:  2001:4ca0:0:201:a:0:2:35: Session trunking failed for 129.187.131.53
Apr 11 16:07:54 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr ::1 not added
Apr 11 16:07:54 lcg-lrz-dcache0 kernel: RPC:   rpc_clnt_test_xprt failed: -111 addr 127.0.0.1 not added

on every upload.

calestyo avatar Apr 11 '22 14:04 calestyo

Hi Chris,

dcache can't update/overwrite existing files. On an other hand, mv should not trigger truncate and should remove existing file. Do you observe the same bebavior when you run mv by hand?

kofemann avatar Apr 13 '22 10:04 kofemann

Hey.

dcache can't update/overwrite existing files

sure but I'd have expected that posix handles mv like a atomic delete/create?

And yes, it also hangs when I do it manually.

calestyo avatar Apr 13 '22 20:04 calestyo

Interestingly though:

/pnfs/lrz-muenchen.de/data/atlas/meta# cp storage-descriptor.json x
/pnfs/lrz-muenchen.de/data/atlas/meta# cp /bin/bash y
/pnfs/lrz-muenchen.de/data/atlas/meta# l
total 1,3M
drwxr-xr-x  5 root  root  512 Apr 13 22:38 .
drwxr-xr-x 12 10761 1307  512 Nov 16  2020 ..
-rw-r--r--  1 root  root 6,8k Apr 13 22:30 storage-descriptor.json
-rw-r--r--  1 root  root 6,8k Apr 13 22:38 x
-rwxr-xr-x  1 root  root 1,3M Apr 13 22:38 y
root@lcg-lrz-dcache0:/pnfs/lrz-muenchen.de/data/atlas/meta# mv y x
mv: overwrite 'x'? y

hangs

^C
/pnfs/lrz-muenchen.de/data/atlas/meta# l
total 1,3M
drwxr-xr-x  4 root  root  512 Apr 13 22:40 .
drwxr-xr-x 12 10761 1307  512 Nov 16  2020 ..
-rw-r--r--  1 root  root 6,8k Apr 13 22:30 storage-descriptor.json
-rwxr-xr-x  1 root  root 1,3M Apr 13 22:38 x

the file does seem to be overwritten:

# hd x | head
00000000  7f 45 4c 46 02 01 01 00  00 00 00 00 00 00 00 00  |.ELF............|
00000010  03 00 3e 00 01 00 00 00  80 06 03 00 00 00 00 00  |..>.............|
00000020  40 00 00 00 00 00 00 00  88 ce 12 00 00 00 00 00  |@...............|
00000030  00 00 00 00 40 00 38 00  0b 00 40 00 1d 00 1c 00  |[email protected]...@.....|
00000040  06 00 00 00 04 00 00 00  40 00 00 00 00 00 00 00  |........@.......|
00000050  40 00 00 00 00 00 00 00  40 00 00 00 00 00 00 00  |@.......@.......|
00000060  68 02 00 00 00 00 00 00  68 02 00 00 00 00 00 00  |h.......h.......|
00000070  08 00 00 00 00 00 00 00  03 00 00 00 04 00 00 00  |................|
00000080  a8 02 00 00 00 00 00 00  a8 02 00 00 00 00 00 00  |................|
00000090  a8 02 00 00 00 00 00 00  1c 00 00 00 00 00 00 00  |................|

calestyo avatar Apr 13 '22 20:04 calestyo