lockfile (flock) problems: left-over unlink/ files and false positive locks being the worst
Description of problem:
When using flock with a code sequence similar to (ignoring errors and timeouts):
fd = open("lockfile", O_RDWR | O_CREAT, 0600);
flock(fd, LOCK_EX);
// assume we have the lock (ie, st_ino from both stat("lockfile") and fstat("lockfile") matches).
...
unlink("lockfile");
close(fd);
various issues occur:
- left-over files in .glusterfs/unlink/ (upon contention and failure of secondary process to obtain lock).
- uninterruptable wait on flock() (which so far only umount -fl /mnt/shared could interrupt) upon ^C during flock().
- ESTALE stale file handles on flock
- ENOENT No such file on flock
- false lock success (stat("lockfile") and fstat(fd) supposedly references the same file, but it doesn't.
Both 3 and 4 seems interesting, because unlinking a file shouldn't delete it, it should merely remove the link to the name, and the file should only be deleted once all names are unlinked, and no further file descriptors are holding it. 3 and 4 probably relates to the same underlying bug and relates to different stages of the same race.
For 2 - this happens when I press ^C during the flock syscall (ie, it's waiting for the lock, sporadically, more reproducable in my "C" code for this than the shell script.).
- is the simplest to reproduce, and the reason for why I started digging.
The exact command to reproduce the issue:
create a replica 2 brick, mount it on /mnt/shared (eg) then execute the following shell commands (assuming /mnt/shared)
morpheus [12:43:17] ~ # gluster volume info shared
Volume Name: shared
Type: Replicate
Volume ID: a4410662-b6e0-4ed0-b1e0-a1cbf168029c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: morpheus:/mnt/gluster/shared
Brick2: r2d2:/mnt/gluster/shared
Options Reconfigured:
cluster.data-self-heal-algorithm: full
cluster.granular-entry-heal: enable
cluster.locking-scheme: granular
nfs.disable: on
transport.address-family: inet
Now mount and grab the lock on morpheus:
morpheus [12:43:27] ~ # mkdir /mnt/shared
morpheus [12:44:11] ~ # mount.glusterfs localhost:/shared /mnt/shared
morpheus [12:44:13] ~ # exec 3>/mnt/shared/lockfile
morpheus [12:44:17] ~ # flock -x 3
At this point, listing in /mnt/gluster/shared/.glusterfs/unlink/ on both bricks shows no files:
morpheus [12:44:20] ~ # ls -lah /mnt/gluster/shared/.glusterfs/unlink/
total 9.0K
drw------- 2 root root 4.0K Jan 17 12:40 .
drw------- 263 root root 5.0K Jun 2 2021 ..
morpheus [12:45:03] ~ #
r2d2 [12:43:21] ~ # ls -lah /mnt/gluster/shared/.glusterfs/unlink/
total 6.0K
drw------- 2 root root 1.0K Jan 17 12:37 .
drw------- 263 root root 5.0K Mar 16 2021 ..
r2d2 [12:45:07] ~ #
Now we do similar on r2d2:
r2d2 [12:45:07] ~ # mkdir /mnt/shared
r2d2 [12:45:56] ~ # mount.glusterfs localhost:/shared /mnt/shared
r2d2 [12:45:59] ~ # exec 3>/mnt/shared/lockfile
r2d2 [12:46:04] ~ # flock -w1 -x 3
r2d2 [12:46:10] ~ # echo $?
1
r2d2 [12:46:17] ~ # exec 3>&-
Again, nothing in /mnt/gluster/shared/.glusterfs/unlink/ on either server, so now the kicker.
morpheus [12:47:49] ~ # rm /mnt/shared/lockfile
morpheus [12:47:56] ~ # exec 3>&-
morpheus [12:48:01] ~ # ls -lah /mnt/gluster/shared/.glusterfs/unlink/
total 10K
drw------- 2 root root 4.0K Jan 17 12:47 .
drw------- 263 root root 5.0K Jun 2 2021 ..
-rw-r--r-- 1 root root 0 Jan 17 12:46 a267a532-6aaf-4a14-8679-571e5c0f27f7
Which could be fine since r2d2 still has the file open, however:
It should be noted that upon execution of the rm the left-over file appears on BOTH bricks, and upon close it only gets removed from ONE of the two. Usually, not always, sometimes both gets left behind (this seems to be a timing issue).
This is 100% reliable in my testing. Killing the affected brick, and running "gluster volume start shared force" fixes this. If the second mount doesn't at least attempt an flock() call then all seems to be OK.
I suspect that because flock() doesn't support a timeout and is implemented by way of flock() being interrupted with SIGALRM, something somewhere ends up holding a reference on the opened file, and it ends up never getting removed. A brick restart cleans it up because it works through the unlink/ files and notices that it's already removed from all bricks, then unlinks the files.
I'd expect this check to form part of the heal process somehow, or the file to not be created in the first place (however, since the file is still open it's not really deleted yet ... and I'm not familiar enough with the internals to know if this is a problem).
And upon close of the file descriptor (the root fix, the other is basically just cleanup in case of bugs) that holds the deleted file open, the unlink/ file should be removed completely.
It seems this relates to the file being rm'ed on files that had flock() taken and had contention at all. Even if flock -u 3 directly before the rm the file still remains behind. It seems that once there has been contention on the flock() call, the file WILL remain behind in .glusterfs/unlink no matter what.
Using fuser shows that glusterfsd is still holding open these files:
morpheus [13:38:36] /mnt/gluster/shared/.glusterfs/unlink # fuser -v *
USER PID ACCESS COMMAND
/mnt/gluster/shared/.glusterfs/unlink/b20a4f93-293f-4387-aa3b-d3ed42f618f6:
root 15591 F.... glusterfsd
/mnt/gluster/shared/.glusterfs/unlink/e92a17d5-70a9-4975-b62b-781449d26f5b:
root 15591 F.... glusterfsd
This will over time (or like on other systems where we've got 5 nodes trying to grab a lock one a minute) either fill up the inodes on the filesystem (ext4 at least) or result in the glusterfsd process hitting EMFILE (unless there happens to be code in place to mitigate this).
Other issues
The other three issues cropped up whilst I was still using a C loop to narrow down the above, and is provided here in that it may provide some additional insight, please advise if this requires separate bugs:
#define _GNU_SOURCE
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/file.h>
#include <sys/stat.h>
#include <errno.h>
#define aperror(fmt, ...) do { int t = errno; char *t2; if (asprintf(&t2, fmt, ## __VA_ARGS__) < 0) { fprintf(stderr, "Memory error trying to output error message\n"); } else { errno = t; perror(t2); }} while(0)
int main(int argc, char** argv)
{
int fd, sc = 0, tc = 0;
time_t st = time(NULL) - 1; /* To avoid div 0. */
const char* lockfile;
struct stat st1, st2;
if (argc != 2) {
fprintf(stderr, "USAGE: %s lockfile\n", *argv);
return 1;
}
lockfile = argv[1];
printf("Lockfile: %s\n", lockfile);
while (1) {
fd = open(lockfile, O_RDWR | O_CREAT, 0600);
if (fd < 0) {
perror(lockfile);
return 1;
}
++tc;
if (flock(fd, LOCK_EX) < 0) {
aperror("flock on fd=%s", lockfile);
close(fd);
continue;
}
if (stat(lockfile, &st1) < 0) {
aperror("stat(%s)", lockfile);
if (errno != ENOENT)
return 1;
} else if (fstat(fd, &st2) < 0) {
aperror("fstat on fd=%s", lockfile);
return 1;
} else if (st1.st_ino == st2.st_ino) {
sc++;
// required to generate some contention.
//usleep(100000 /* 100ms */);
if (unlink(lockfile) < 0)
aperror("unlink(%s)", lockfile);
}
close(fd);
if (tc % 10 == 0) {
time_t exp_t = time(NULL) - st;
printf("\rProgress: %d/%d (%ld seconds, %.2f/s).", sc, tc, exp_t, (float)tc / exp_t);
fflush(stdout);
}
}
printf("\n");
}
Issue 2 (uninterruptable wait) occurs if/when I ^C the two processes in too short succession (suspect this is dependent on the exact timing of killing the first process as it's not 100% reliable, it's always the second to be ^Ced that locks up). strace all but eliminates the issue, however /proc/${fd_of_locked_up_process}/syscall shows it's stuck in flock:
The other issues occur as per normal running, and there is another race condition too that I just spotted:
flock on fd=/mnt/shared/lockfile: Stale file handle
flock on fd=/mnt/shared/lockfile: Stale file handle
flock on fd=/mnt/shared/lockfile: Stale file handle
flock on fd=/mnt/shared/lockfile: Stale file handle
flock on fd=/mnt/shared/lockfile: Stale file handle
unlink(/mnt/shared/lockfile): No such file or directory
flock on fd=/mnt/shared/lockfile: No such file or directory
flock on fd=/mnt/shared/lockfile: No such file or directory
flock on fd=/mnt/shared/lockfile: No such file or directory
flock on fd=/mnt/shared/lockfile: No such file or directory
flock on fd=/mnt/shared/lockfile: No such file or directory
Based on the code the unlink() can ONLY ever happen if we hold the lock, which implies that somewhere along the line stat("lockfile") and fstat(fd) returns the same st_ino values, but the filename no longer references the file that was opened. This means that processes can believe they have the lock, whilst they don't actually. Perhaps someone can point out a bug in my "C" code, but this works just fine on an ext4 filesystem (Just cycled 4 processes for ~1000000 iterations each before giving up). With one variation:
The above code outputs perror() for stat() too on ext4 that's the only failure that does happen, outputting lines like "stat(/tmp/lockfile): No such file or directory". This was never output on glusterfs (which thinking about it is very odd). Even introducing the usleep(100 or 100000) above doesn't trigger any issues on ext4, it just lowers the number of iterations/second).
The full output of the command that failed:
See above.
Expected results:
I expect there to be no left-over files in the .glusterfs/unlink/ folders.
I expect to not see ENOENT or ESTALE errors (if possible), although these can be worked around and most implementation that deals with flock() will likely already do so.
Mandatory info:
- The output of the gluster volume info command:
morpheus [11:14:29] ~ # gluster volume info shared
Volume Name: shared
Type: Replicate
Volume ID: a4410662-b6e0-4ed0-b1e0-a1cbf168029c
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: morpheus:/mnt/gluster/shared
Brick2: r2d2:/mnt/gluster/shared
Options Reconfigured:
cluster.data-self-heal-algorithm: full
cluster.granular-entry-heal: enable
cluster.locking-scheme: granular
nfs.disable: on
transport.address-family: inet
- The output of the gluster volume status command:
morpheus [11:14:31] ~ # gluster volume status shared
Status of volume: shared
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick morpheus:/mnt/gluster/share
d 54561 0 Y 27205
Brick r2d2:/mnt/gluster/shared 50032 0 Y 28040
Self-heal Daemon on localhost N/A N/A Y 20910
Self-heal Daemon on r2d2 N/A N/A Y 28143
Task Status of Volume shared
------------------------------------------------------------------------------
There are no active volume tasks
- The output of the gluster volume heal command:
morpheus [11:14:48] ~ # gluster volume heal shared
Launching heal operation to perform index self heal on volume shared has been successful
Use heal info commands to check status.
morpheus [11:15:07] ~ # gluster volume heal shared info
Brick morpheus:/mnt/gluster/shared
Status: Connected
Number of entries: 0
Brick r2d2:/mnt/gluster/shared
Status: Connected
Number of entries: 0
**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
Total currently ~700MB worth of logs here, almost all of that is for my testing over the last few hours. Going through the cycle again with a tail -F /var/log/glusterfs/mnt-shared.log on both servers, I believe this to be the relevant logs (I've isolated the log entries that happens when executing flock with newlines):
morpheus:
[2022-01-17 11:06:47.164461 +0000] I [glusterfsd.c:2447:daemonize] 0-glusterfs: Pid of current running process is 19118
[2022-01-17 11:06:47.166746 +0000] E [MSGID: 101242] [gf-io-uring.c:410:gf_io_uring_setup] 0-io: io_uring_setup() failed. [Cannot allocate memory]
[2022-01-17 11:06:47.173241 +0000] I [MSGID: 101190] [event-epoll.c:667:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2022-01-17 11:06:47.173333 +0000] I [MSGID: 101190] [event-epoll.c:667:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2022-01-17 11:06:47.205074 +0000] I [glusterfsd-mgmt.c:2203:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: r2d2:24007
[2022-01-17 11:06:47.207611 +0000] I [io-stats.c:3701:ios_sample_buf_size_configure] 0-shared: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0
[2022-01-17 11:06:47.210021 +0000] I [MSGID: 114020] [client.c:2336:notify] 0-shared-client-0: parent translators are ready, attempting connect on transport []
[2022-01-17 11:06:47.210911 +0000] I [MSGID: 114020] [client.c:2336:notify] 0-shared-client-1: parent translators are ready, attempting connect on transport []
[2022-01-17 11:06:47.211260 +0000] I [rpc-clnt.c:1969:rpc_clnt_reconfig] 0-shared-client-0: changing port to 53539 (from 0)
[2022-01-17 11:06:47.211312 +0000] I [socket.c:834:__socket_shutdown] 0-shared-client-0: intentional socket shutdown(12)
Final graph:
+------------------------------------------------------------------------------+
1: volume shared-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host morpheus
5: option remote-subvolume /mnt/gluster/shared
6: option transport-type socket
7: option transport.address-family inet
8: option username 4285afa4-3741-4c89-9e55-caec55a9207c
9: option password fafded4a-64c4-40be-96f1-51d8a3cb0180
10: option transport.socket.ssl-enabled off
11: option transport.tcp-user-timeout 0
12: option transport.socket.keepalive-time 20
13: option transport.socket.keepalive-interval 2
14: option transport.socket.keepalive-count 9
15: option strict-locks off
16: option send-gids true
17: end-volume
18:
19: volume shared-client-1
20: type protocol/client
21: option ping-timeout 42
22: option remote-host r2d2
23: option remote-subvolume /mnt/gluster/shared
24: option transport-type socket
25: option transport.address-family inet
26: option username 4285afa4-3741-4c89-9e55-caec55a9207c
27: option password fafded4a-64c4-40be-96f1-51d8a3cb0180
28: option transport.socket.ssl-enabled off
29: option transport.tcp-user-timeout 0
30: option transport.socket.keepalive-time 20
31: option transport.socket.keepalive-interval 2
32: option transport.socket.keepalive-count 9
33: option strict-locks off
34: option send-gids true
35: end-volume
36:
37: volume shared-replicate-0
38: type cluster/replicate
39: option afr-pending-xattr shared-client-0,shared-client-1
40: option volume-id a4410662-b6e0-4ed0-b1e0-a1cbf168029c
41: option data-self-heal-algorithm full
42: option locking-scheme granular
43: option granular-entry-heal enable
44: option use-compound-fops off
45: option use-anonymous-inode yes
46: subvolumes shared-client-0 shared-client-1
47: end-volume
48:
49: volume shared-dht
50: type cluster/distribute
51: option lock-migration off
52: option force-migration off
53: subvolumes shared-replicate-0
54: end-volume
55:
56: volume shared-utime
57: type features/utime
58: option noatime on
59: subvolumes shared-dht
60: end-volume
61:
62: volume shared-write-behind
63: type performance/write-behind
64: subvolumes shared-utime
65: end-volume
66:
67: volume shared-open-behind
68: type performance/open-behind
69: subvolumes shared-write-behind
70: end-volume
71:
[2022-01-17 11:06:47.212590 +0000] I [MSGID: 114057] [client-handshake.c:1118:select_server_supported_programs] 0-shared-client-0: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
72: volume shared-quick-read
73: type performance/quick-read
74: subvolumes shared-open-behind
75: end-volume
76:
77: volume shared-md-cache
78: type performance/md-cache
79: subvolumes shared-quick-read
80: end-volume
81:
82: volume shared
83: type debug/io-stats
84: option log-level INFO
85: option threads 16
86: option latency-measurement off
87: option count-fop-hits off
88: option global-threading off
89: subvolumes shared-md-cache
90: end-volume
91:
92: volume meta-autoload
93: type meta
94: subvolumes shared
95: end-volume
96:
+------------------------------------------------------------------------------+
[2022-01-17 11:06:47.213187 +0000] I [rpc-clnt.c:1969:rpc_clnt_reconfig] 0-shared-client-1: changing port to 50032 (from 0)
[2022-01-17 11:06:47.213301 +0000] I [socket.c:834:__socket_shutdown] 0-shared-client-1: intentional socket shutdown(13)
[2022-01-17 11:06:47.213738 +0000] I [MSGID: 114046] [client-handshake.c:856:client_setvolume_cbk] 0-shared-client-0: Connected, attached to remote volume [{conn-name=shared-client-0}, {remote_subvol=/mnt/gluster/shared}]
[2022-01-17 11:06:47.213783 +0000] I [MSGID: 108005] [afr-common.c:6045:__afr_handle_child_up_event] 0-shared-replicate-0: Subvolume 'shared-client-0' came back up; going online.
[2022-01-17 11:06:47.215395 +0000] I [MSGID: 114057] [client-handshake.c:1118:select_server_supported_programs] 0-shared-client-1: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2022-01-17 11:06:47.232645 +0000] I [MSGID: 114046] [client-handshake.c:856:client_setvolume_cbk] 0-shared-client-1: Connected, attached to remote volume [{conn-name=shared-client-1}, {remote_subvol=/mnt/gluster/shared}]
[2022-01-17 11:06:47.234447 +0000] I [fuse-bridge.c:5294:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.31
[2022-01-17 11:06:47.234513 +0000] I [fuse-bridge.c:5926:fuse_graph_sync] 0-fuse: switched to graph 0
[2022-01-17 11:06:47.235889 +0000] I [MSGID: 108031] [afr-common.c:3201:afr_local_discovery_cbk] 0-shared-replicate-0: selecting local read_child shared-client-0
[2022-01-17 11:07:04.876506 +0000] W [dict.c:1584:dict_get_with_ref] (-->/usr/lib64/glusterfs/10.0/xlator/protocol/client.so(+0x541f6) [0x7f9f43fd21f6] -->/usr/lib64/libglusterfs.so.0(dict_get_int32+0x39) [0x7f9f4db00a19] -->/usr/lib64/libglusterfs.so.0(dict_get_with_ref+0xdb) [0x7f9f4db003fb] ) 0-dict: dict OR key (fd-reopen-status) is NULL [Invalid argument]
[2022-01-17 11:07:04.877366 +0000] W [dict.c:1584:dict_get_with_ref] (-->/usr/lib64/glusterfs/10.0/xlator/protocol/client.so(+0x541f6) [0x7f9f43fd21f6] -->/usr/lib64/libglusterfs.so.0(dict_get_int32+0x39) [0x7f9f4db00a19] -->/usr/lib64/libglusterfs.so.0(dict_get_with_ref+0xdb) [0x7f9f4db003fb] ) 0-dict: dict OR key (fd-reopen-status) is NULL [Invalid argument]
[2022-01-17 11:07:45.021864 +0000] I [fuse-bridge.c:6303:fuse_thread_proc] 0-fuse: initiating unmount of /mnt/shared
[2022-01-17 11:07:45.021077 +0000] I [MSGID: 108031] [afr-common.c:3201:afr_local_discovery_cbk] 0-shared-replicate-0: selecting local read_child shared-client-0
[2022-01-17 11:07:45.022187 +0000] W [glusterfsd.c:1458:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7d0e) [0x7f9f4da46d0e] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xc5) [0x55a98ac66af5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x55a98ac66974] ) 0-: received signum (15), shutting down
[2022-01-17 11:07:45.022221 +0000] I [fuse-bridge.c:7057:fini] 0-fuse: Unmounting '/mnt/shared'.
[2022-01-17 11:07:45.022237 +0000] I [fuse-bridge.c:7061:fini] 0-fuse: Closing fuse connection to '/mnt/shared'.
r2d2:
[2022-01-17 11:06:52.332429 +0000] I [MSGID: 100030] [glusterfsd.c:2767:main] 0-/usr/sbin/glusterfs: Started running version [{arg=/usr/sbin/glusterfs}, {version=10.0}, {cmdlinestr=/usr/sbin/glusterfs --process-name fuse --volfile-server=localhost --volfile-id=/shared /mnt/shared}]
[2022-01-17 11:06:52.413323 +0000] I [glusterfsd.c:2447:daemonize] 0-glusterfs: Pid of current running process is 7315
[2022-01-17 11:06:52.414947 +0000] E [MSGID: 101242] [gf-io-uring.c:410:gf_io_uring_setup] 0-io: io_uring_setup() failed. [Cannot allocate memory]
[2022-01-17 11:06:52.463889 +0000] I [MSGID: 101190] [event-epoll.c:667:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=0}]
[2022-01-17 11:06:52.463945 +0000] I [MSGID: 101190] [event-epoll.c:667:event_dispatch_epoll_worker] 0-epoll: Started thread with index [{index=1}]
[2022-01-17 11:06:52.687885 +0000] I [glusterfsd-mgmt.c:2203:mgmt_getspec_cbk] 0-glusterfs: Received list of available volfile servers: morpheus:24007
[2022-01-17 11:06:52.968275 +0000] I [io-stats.c:3701:ios_sample_buf_size_configure] 0-shared: Configure ios_sample_buf size is 1024 because ios_sample_interval is 0
[2022-01-17 11:06:52.970534 +0000] I [MSGID: 114020] [client.c:2336:notify] 0-shared-client-0: parent translators are ready, attempting connect on transport []
[2022-01-17 11:06:52.976608 +0000] I [MSGID: 114020] [client.c:2336:notify] 0-shared-client-1: parent translators are ready, attempting connect on transport []
Final graph:
+------------------------------------------------------------------------------+
1: volume shared-client-0
2: type protocol/client
3: option ping-timeout 42
4: option remote-host morpheus
5: option remote-subvolume /mnt/gluster/shared
6: option transport-type socket
7: option transport.address-family inet
8: option username 4285afa4-3741-4c89-9e55-caec55a9207c
9: option password fafded4a-64c4-40be-96f1-51d8a3cb0180
10: option transport.socket.ssl-enabled off
11: option transport.tcp-user-timeout 0
12: option transport.socket.keepalive-time 20
13: option transport.socket.keepalive-interval 2
14: option transport.socket.keepalive-count 9
15: option strict-locks off
16: option send-gids true
17: end-volume
18:
19: volume shared-client-1
20: type protocol/client
21: option ping-timeout 42
22: option remote-host r2d2
23: option remote-subvolume /mnt/gluster/shared
24: option transport-type socket
25: option transport.address-family inet
26: option username 4285afa4-3741-4c89-9e55-caec55a9207c
27: option password fafded4a-64c4-40be-96f1-51d8a3cb0180
28: option transport.socket.ssl-enabled off
29: option transport.tcp-user-timeout 0
30: option transport.socket.keepalive-time 20
31: option transport.socket.keepalive-interval 2
32: option transport.socket.keepalive-count 9
33: option strict-locks off
34: option send-gids true
35: end-volume
36:
37: volume shared-replicate-0
38: type cluster/replicate
39: option afr-pending-xattr shared-client-0,shared-client-1
40: option volume-id a4410662-b6e0-4ed0-b1e0-a1cbf168029c
41: option data-self-heal-algorithm full
42: option locking-scheme granular
43: option granular-entry-heal enable
[2022-01-17 11:06:52.977434 +0000] I [rpc-clnt.c:1969:rpc_clnt_reconfig] 0-shared-client-1: changing port to 50032 (from 0)
44: option use-compound-fops off
45: option use-anonymous-inode yes
46: subvolumes shared-client-0 shared-client-1
47: end-volume
48:
49: volume shared-dht
50: type cluster/distribute
51: option lock-migration off
52: option force-migration off
53: subvolumes shared-replicate-0
54: end-volume
55:
56: volume shared-utime
57: type features/utime
[2022-01-17 11:06:52.977548 +0000] I [socket.c:834:__socket_shutdown] 0-shared-client-1: intentional socket shutdown(13)
[2022-01-17 11:06:52.977601 +0000] I [rpc-clnt.c:1969:rpc_clnt_reconfig] 0-shared-client-0: changing port to 53539 (from 0)
[2022-01-17 11:06:52.977634 +0000] I [socket.c:834:__socket_shutdown] 0-shared-client-0: intentional socket shutdown(12)
58: option noatime on
59: subvolumes shared-dht
60: end-volume
61:
62: volume shared-write-behind
63: type performance/write-behind
64: subvolumes shared-utime
65: end-volume
66:
67: volume shared-open-behind
68: type performance/open-behind
69: subvolumes shared-write-behind
70: end-volume
71:
72: volume shared-quick-read
73: type performance/quick-read
74: subvolumes shared-open-behind
75: end-volume
76:
77: volume shared-md-cache
78: type performance/md-cache
79: subvolumes shared-quick-read
80: end-volume
81:
82: volume shared
83: type debug/io-stats
84: option log-level INFO
85: option threads 16
86: option latency-measurement off
87: option count-fop-hits off
88: option global-threading off
89: subvolumes shared-md-cache
90: end-volume
91:
92: volume meta-autoload
93: type meta
94: subvolumes shared
95: end-volume
96:
+------------------------------------------------------------------------------+
[2022-01-17 11:06:52.978404 +0000] I [MSGID: 114057] [client-handshake.c:1118:select_server_supported_programs] 0-shared-client-1: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2022-01-17 11:06:52.978881 +0000] I [MSGID: 114057] [client-handshake.c:1118:select_server_supported_programs] 0-shared-client-0: Using Program [{Program-name=GlusterFS 4.x v1}, {Num=1298437}, {Version=400}]
[2022-01-17 11:06:52.979133 +0000] I [MSGID: 114046] [client-handshake.c:856:client_setvolume_cbk] 0-shared-client-1: Connected, attached to remote volume [{conn-name=shared-client-1}, {remote_subvol=/mnt/gluster/shared}]
[2022-01-17 11:06:52.979161 +0000] I [MSGID: 108005] [afr-common.c:6045:__afr_handle_child_up_event] 0-shared-replicate-0: Subvolume 'shared-client-1' came back up; going online.
[2022-01-17 11:06:52.980247 +0000] I [MSGID: 114046] [client-handshake.c:856:client_setvolume_cbk] 0-shared-client-0: Connected, attached to remote volume [{conn-name=shared-client-0}, {remote_subvol=/mnt/gluster/shared}]
[2022-01-17 11:06:52.982004 +0000] I [fuse-bridge.c:5294:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.31
[2022-01-17 11:06:52.982042 +0000] I [fuse-bridge.c:5926:fuse_graph_sync] 0-fuse: switched to graph 0
[2022-01-17 11:06:52.983099 +0000] I [MSGID: 108031] [afr-common.c:3201:afr_local_discovery_cbk] 0-shared-replicate-0: selecting local read_child shared-client-1
[2022-01-17 11:07:26.107896 +0000] W [dict.c:1584:dict_get_with_ref] (-->/usr/lib64/glusterfs/10.0/xlator/protocol/client.so(+0x53f46) [0x7f0eae5fff46] -->/usr/lib64/libglusterfs.so.0(dict_get_int32+0x39) [0x7f0eb40b6a09] -->/usr/lib64/libglusterfs.so.0(dict_get_with_ref+0xdb) [0x7f0eb40b63eb] ) 0-dict: dict OR key (fd-reopen-status) is NULL [Invalid argument]
[2022-01-17 11:07:27.107641 +0000] W [MSGID: 114031] [client-rpc-fops_v2.c:2214:client4_0_lk_cbk] 0-shared-client-0: remote operation failed. [{errno=4}, {error=Interrupted system call}]
[2022-01-17 11:07:27.107778 +0000] W [dict.c:1584:dict_get_with_ref] (-->/usr/lib64/glusterfs/10.0/xlator/protocol/client.so(+0x53f46) [0x7f0eae5fff46] -->/usr/lib64/libglusterfs.so.0(dict_get_int32+0x39) [0x7f0eb40b6a09] -->/usr/lib64/libglusterfs.so.0(dict_get_with_ref+0xdb) [0x7f0eb40b63eb] ) 0-dict: dict OR key (fd-reopen-status) is NULL [Invalid argument]
[2022-01-17 11:07:48.509449 +0000] I [MSGID: 108031] [afr-common.c:3201:afr_local_discovery_cbk] 0-shared-replicate-0: selecting local read_child shared-client-1
[2022-01-17 11:07:48.510595 +0000] I [fuse-bridge.c:6303:fuse_thread_proc] 0-fuse: initiating unmount of /mnt/shared
[2022-01-17 11:07:48.510739 +0000] W [glusterfsd.c:1458:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7cde) [0x7f0eb3ffccde] -->/usr/sbin/glusterfs(glusterfs_sigwaiter+0xc5) [0x561885fa6af5] -->/usr/sbin/glusterfs(cleanup_and_exit+0x54) [0x561885fa6974] ) 0-: received signum (15), shutting down
[2022-01-17 11:07:48.510785 +0000] I [fuse-bridge.c:7057:fini] 0-fuse: Unmounting '/mnt/shared'.
[2022-01-17 11:07:48.510800 +0000] I [fuse-bridge.c:7061:fini] 0-fuse: Closing fuse connection to '/mnt/shared'.
**- Is there any crash ? Provide the backtrace and coredump
No crash.
Additional info:
I do believe the above is very comprehensive.
- The operating system / glusterfs version:
Gentoo. Glusterfs 10.0. If there are specific package versions outside of glusterfs you're interested in, please ask. Happy to assist with testing patches, or even with code if I can be pointed in the right direction.
Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.