glusterfs fuse concurrency problems

Description of problem:

Bunch of crashes (SIGABRT + SIGSEGV) that only seems to happen when concurrency is in place (by way of --global-threads, client.event-threads or --reader-thread-count)

Previous issues I believe that relates, but both filed under older versions of glusterfs:

#3379 #3389

Backtrace information seems to have changed moving to 10.2 compared to 10.1.

The exact command to reproduce the issue:

Use glusterfs under heavy use. Unfortunately we're not able to pinpoint an exact pattern.

The full output of the command that failed:

No specific command, but the fuse mount becomes inaccessible.

Mount command:

/usr/sbin/glusterfs --lru-limit=1048576 --invalidate-limit=64 --background-qlen=32 --fuse-mountopts=noatime,nosuid,noexec,nodev --process-name fuse --volfile-server=localhost --volfile-id=mail --fuse-mountopts=noatime,nosuid,noexec,nodev /var/spool/mail

Even 512k lru-limit seems to cause pressure (default is 64k). invalidate-limit > 64 causes performance bottlenecks, as does background-qlen > 32 (default is 64 if I recall correctly).

Expected results:

The fuse process shouldn't crash. Ever.

Mandatory info: - The output of the gluster volume info command:

Volume Name: mail
Type: Replicate
Volume ID: 2938a063-f53d-4a1c-a84f-c1406bdc260d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 2 = 2
Transport-type: tcp
Bricks:
Brick1: server-a:/mnt/gluster/mail
Brick2: server-b:/mnt/gluster/mail
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
cluster.granular-entry-heal: enable
cluster.readdir-optimize: on
performance.least-prio-threads: 8
performance.low-prio-threads: 8
performance.normal-prio-threads: 16
performance.high-prio-threads: 32
performance.io-thread-count: 64
cluster.data-self-heal-algorithm: full
server.event-threads: 2
config.client-threads: 2
client.event-threads: 4
config.brick-threads: 32
config.global-threading: on
performance.iot-pass-through: true

- The output of the gluster volume status command:

Status of volume: mail
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick server-a:/mnt/gluster/mail    52746     0          Y       17113
Brick server-b:/mnt/gluster/mail 56038     0          Y       19605
Self-heal Daemon on localhost               N/A       N/A        Y       22011
Self-heal Daemon on uriel.interexcel.co.za  N/A       N/A        Y       10033
 
Task Status of Volume mail
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command:

Brick server-a:/mnt/gluster/mail
Status: Connected
Number of entries: 0

Brick server-b:/mnt/gluster/mail
Status: Connected
Number of entries: 0

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

**- Is there any crash ? Provide the backtrace and coredump

Every single backtrace that we've got on record for glusterfs 10.2

Additional info:

- The operating system / glusterfs version:

gentoo linux. Kernel 5.8.14. glusterfs 10.2.

Jul 20 '22 14:07 jkroonza

We just had a similar crash in one of the brick processes, so I don't think this is isolated to fuse, it's just more regular on fuse side. Unfortunately the stack traces for the brick process is corrupt, but we can confirm SIGABRT.

Will see if we can sort out the backtrace situation.

Jul 22 '22 10:07 jkroonza

Do you happen to know if it happens with memory pools disabled?

Aug 17 '22 07:08 mykaul

@mykaul I believe memory pools are disabled, from ./configure:

GlusterFS configure summary
===========================
FUSE client          : yes
epoll IO multiplex   : yes
fusermount           : no
readline             : yes
georeplication       : yes
Linux-AIO            : yes
Linux io_uring       : yes
Use liburing         : yes
Enable Debug         : no
Run with Valgrind    : no
Sanitizer enabled    : none
XML output           : yes
Unit Tests           : no
Track priv ports     : yes
POSIX ACLs           : yes
SELinux features     : yes
firewalld-config     : no
Events               : yes
EC dynamic support   : x64 sse avx
Use memory pools     : no   <---
Nanosecond m/atimes  : yes
Server components    : yes
Legacy gNFS server   : no
IPV6 default         : no
Use TIRPC            : yes
With Python          : 3.10
Cloudsync            : yes
Metadata dispersal   : no
Link with TCMALLOC   : yes
Enable Brick Mux     : no
Building with LTO    : no

Aug 17 '22 10:08 jkroonza

At least some (did not look at all) of the stack traces had the pool sweeper in the traces, so I thought memory pools were enabled there.

Aug 17 '22 11:08 mykaul

Perhaps I'm wrong, is there some definite way to confirm that it's disabled?

The PR you pointed out to me last night (#3226) looks to potentially fix a couple of locking issues in inode.c - imho that could possibly relate here, busy looking to apply that patch series on top of 10.2 in order to perform some basic testing first on my test client, and if that checks out I'll deploy to one of the production nodes.

Aug 17 '22 11:08 jkroonza

Would it be possible for you try to mount a volume without global-thread, please try io-threads instead of using global-thread for client. You can enable io-threads only for client and avoid to pass global-thread argument for client.

Aug 18 '22 05:08 mohit84

@mohit84 we're not using global threads on the clients, only the bricks:

Options Reconfigured:
cluster.locking-scheme: granular
performance.open-behind: off  <-- causes other issues.
performance.iot-pass-through: true  <-- IOT off.
config.global-threading: on <-- this only applies to the bricks, not the clients.
config.brick-threads: 32  <-- global thread count for bricks.
client.event-threads: 4  <-- epoll() threads for the client, this we should lower to 2 for starters, possibly 1.
config.client-threads: 2 <-- global trhead count for client's when they're enabled with CLI option on the fuse mount.
server.event-threads: 2 <-- server only needs two here, so clients really don't need four.
cluster.data-self-heal-algorithm: full  <-- because it's faster for us to copy than to diff.
performance.io-thread-count: 64
performance.high-prio-threads: 32
performance.normal-prio-threads: 16
performance.low-prio-threads: 8
performance.least-prio-threads: 8  <-- all of these performance things relates to IOT.
cluster.readdir-optimize: on
cluster.granular-entry-heal: enable
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off <-- iot on client off.

I've proceeded to lower client.event-threads to 2 for now. This may already help reduce the frequency as I believe this is caused by a race condition resulting in memory corruption in some cases, usually resulting in crashes (but sometimes processes that just sits and consumes resources not making any progress).

Fuse process itself started as:

/usr/sbin/glusterfs --lru-limit=1048576 --invalidate-limit=64 --background-qlen=32 --fuse-mountopts=noatime,nosuid,noexec,nodev --process-name fuse --volfile-server=localhost --volfile-id=mail --fuse-mountopts=noatime,nosuid,noexec,nodev /var/spool/mail

Note: no --global-threading.

The threads for the fuse mount now (as listed by my gluster_list_threads tool):

fuse:localhost:mail-/var/spool/mail (pid=13068): 28 threads
  - glusterfs (1): S x 1.
  - glfs_timer (1): S x 1.
  - glfs_sigwait (1): S x 1.
  - glfs_worker (17): S x 17.
  - glfs_memsweep (1): S x 1.
  - glfs_sproc (2): S x 2.
  - glfs_epoll (2): S x 1 R x 1.
  - glfs_fuseproc (1): S x 1.
  - glfs_fusedlyd (1): S x 1.
  - glfs_fusenoti (1): S x 1.

No iot threads that I can see, to this day I've not managed to figure out where the glfs_worker threads come from. Nor do they seem to do any work, ever.

I'm assuming @mykaul was referring here to the glfs_memsweep threads above, which if it's a memory pool thing and memory pools are disabled - I'm not sure what it's doing here at all.

Aug 18 '22 07:08 jkroonza

@mohit84 we're not using global threads on the clients, only the bricks:
Options Reconfigured:
cluster.locking-scheme: granular
performance.open-behind: off  <-- causes other issues.
performance.iot-pass-through: true  <-- IOT off.
config.global-threading: on <-- this only applies to the bricks, not the clients.
config.brick-threads: 32  <-- global thread count for bricks.
client.event-threads: 4  <-- epoll() threads for the client, this we should lower to 2 for starters, possibly 1.
config.client-threads: 2 <-- global trhead count for client's when they're enabled with CLI option on the fuse mount.
server.event-threads: 2 <-- server only needs two here, so clients really don't need four.
cluster.data-self-heal-algorithm: full  <-- because it's faster for us to copy than to diff.
performance.io-thread-count: 64
performance.high-prio-threads: 32
performance.normal-prio-threads: 16
performance.low-prio-threads: 8
performance.least-prio-threads: 8  <-- all of these performance things relates to IOT.
cluster.readdir-optimize: on
cluster.granular-entry-heal: enable
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off <-- iot on client off.
I've proceeded to lower client.event-threads to 2 for now. This may already help reduce the frequency as I believe this is caused by a race condition resulting in memory corruption in some cases, usually resulting in crashes (but sometimes processes that just sits and consumes resources not making any progress).

Fuse process itself started as:
/usr/sbin/glusterfs --lru-limit=1048576 --invalidate-limit=64 --background-qlen=32 --fuse-mountopts=noatime,nosuid,noexec,nodev --process-name fuse --volfile-server=localhost --volfile-id=mail --fuse-mountopts=noatime,nosuid,noexec,nodev /var/spool/mail
Note: no --global-threading.

The threads for the fuse mount now (as listed by my gluster_list_threads tool):
fuse:localhost:mail-/var/spool/mail (pid=13068): 28 threads
  - glusterfs (1): S x 1.
  - glfs_timer (1): S x 1.
  - glfs_sigwait (1): S x 1.
  - glfs_worker (17): S x 17.
  - glfs_memsweep (1): S x 1.
  - glfs_sproc (2): S x 2.
  - glfs_epoll (2): S x 1 R x 1.
  - glfs_fuseproc (1): S x 1.
  - glfs_fusedlyd (1): S x 1.
  - glfs_fusenoti (1): S x 1.
No iot threads that I can see, to this day I've not managed to figure out where the glfs_worker threads come from. Nor do they seem to do any work, ever.

I'm assuming @mykaul was referring here to the glfs_memsweep threads above, which if it's a memory pool thing and memory pools are disabled - I'm not sure what it's doing here at all.

Actually, I missed reading stack backtrace, the thread(gf_io_thread_main) was spawned while io_uring is enabled not global-thread, Can you please try after disabling io_uring?

Aug 18 '22 07:08 mohit84

Perhaps the client was compiled and deployed separately from the server? (which could explain why it has memory pools, or at least the sweeper, enabled) ?

Aug 18 '22 09:08 mykaul

Actually, I missed reading stack backtrace, the thread(gf_io_thread_main) was spawned while io_uring is enabled not global-thread, Can you please try after disabling io_uring?

You referring to compile time options or the runtime options?

uriel [17:13:41] ~ # gluster volume get mail all | grep uring storage.linux-io_uring off (DEFAULT)

Happy to switch two nodes to be compiled with --disable-linux_io_uring if that is what you mean?

Aug 18 '22 15:08 jkroonza

Perhaps the client was compiled and deployed separately from the server? (which could explain why it has memory pools, or at least the sweeper, enabled) ?

No,all clients/servers compiled with exactly the same options. io-uring however is enabled at compile time as per above, and I have noticed that even with storage.linux-io_uring off some functions still use it.

Aug 18 '22 15:08 jkroonza

Actually, I missed reading stack backtrace, the thread(gf_io_thread_main) was spawned while io_uring is enabled not global-thread, Can you please try after disabling io_uring?

You referring to compile time options or the runtime options?

uriel [17:13:41] ~ # gluster volume get mail all | grep uring storage.linux-io_uring off (DEFAULT)

Happy to switch two nodes to be compiled with --disable-linux_io_uring if that is what you mean?

You need to disable it during compilation, Yes you need to compile with --disable-linux_io_uring

Aug 19 '22 04:08 mohit84

@mohit84 two our nodes has been completed with this. Out of hand we can say that the load average for these two nodes have gone up significantly (nearly doubled), we will however have to wait and see if the crash problem has sorted itself out. So I'll feedback if one of these two nodes should go down again.

Aug 19 '22 08:08 jkroonza

Report back, since post above:

node 1 (with iouring): 2 crashes node 2 (with iouring): no crashes node 3 (without iouring): 2 crashes node 4 (with iouring): no crashes

So there is no definitive answer on whether or not iouring makes a difference in my opinion. Perhaps the two non-iouring crashes contains additional information, so attaching:

stack-20220821-152638-core-glfs_epoll000.31304.1661088398.txt stack-20220822-101033-core-glfs_fuseproc.16243.1661155833.txt

These two have no iouring involved, which may simplify the analysis.

Aug 23 '22 19:08 jkroonza

Following up again, since making the switch:

node 1 (with iouring): 4 crashes node 2 (with iouring): 4 crashes node 3 (without iouring): 2 crashes (same two from previous post) node 4 (without iouring): no crashes (was incorrectly labeled as with iouring above).

So I'm sondering if perhaps I missed killing two hung processes on node 3 that just crashed much later.

Updating all nodes to no uring now.

Aug 27 '22 09:08 jkroonza

Whilst things may be less frequent, we're definitely still seeing problems on this front.

Aug 29 '22 11:08 jkroonza

Whilst things may be less frequent, we're definitely still seeing problems on this front.

I don't think it has to do with iouring, looks like a memory corruption to me (which of course is harder to pinpoint), from looking at some of the traces.

Aug 29 '22 13:08 mykaul

Can you disable open-behind? Just a wild guess, based on seeing it on multiple traces - nothing specific.

Aug 29 '22 14:08 mykaul

Can you disable open-behind? Just a wild guess, based on seeing it on multiple traces - nothing specific.

Unless there are multiple settings it already is:

# gluster volume info
...
performance.open-behind: off

However, just noticed that as usual there seems to be multiple settings that seems to need to be set in conjunction to achieve a single purpose, where one could do:

bagheera [16:32:13] ~ # gluster volume get mail all | grep open-behind
performance.open-behind-pass-through     false (DEFAULT)                        
performance.open-behind                  off

I've now enabled performance.open-behind-pass-through:

bagheera [16:34:05] ~ # gluster volume get mail all | grep open-behind
performance.open-behind-pass-through     true                                   
performance.open-behind                  off

Not seeing an auto-reload in the logs ... and it takes me several hours to drain a single server by removing it from the load balancers, so will do so this evening when it's less quiet and can do a server in ten minutes or so by forcing things out.

Aug 29 '22 14:08 jkroonza

Whilst things may be less frequent, we're definitely still seeing problems on this front.

I don't think it has to do with iouring, looks like a memory corruption to me (which of course is harder to pinpoint), from looking at some of the traces.

I agree with this, which is why your other PR where there are changes w.r.t. locking in inode.c looks interesting. I suspect the corruption is in inode.c, for no good reason other than there is in my opinion a definite correlation between crash frequency and lru-limit.

Aug 29 '22 14:08 jkroonza

@xhernandez

As from #3716 - these are the crashes we're getting. Based on the last discussion in that PR:

Since applying that patch to glusterfs 10.2 our overall frequency of crashes has reduced considerably. We've set the two less powerful nodes to use --inode-table-size=1048576 on these two nodes.
Prior to applying this patch we were approaching two crashes per day over four nodes (meaning on average every node was crashing somewhere between 48 and 60h intervals).
Since applying #3716 to two of the nodes we've seen a definite decrease in the load average of the systems to which the patch has been applied.
These two nodes are crashing with a lower frequency compared to the two unpatched nodes (which we'll probably also patch today).

Based on the discussion in #3716 (summarised above) we believe that there is a problem in inode.c somewhere where nodes are inserted or removed from the hashtable in an unsafe manner, resulting in memory corruption, which either leads to extremely poor performance or more likely crashes of the fuse process. The end result is the same, an unusable system.

Oct 07 '22 06:10 jkroonza

@jkroonza does your workload contain a lot of renames ?

Is your volume configuration stable, or do you change some setting frequently (at least once between crashes) ?

Do you generate statedumps or do any special operation periodically ?

Oct 07 '22 09:10 xhernandez

And one more thing: do you get crashes in bricks and clients, or only on clients ?

If both crash, is the frequency similar ?

Oct 07 '22 09:10 xhernandez

It would be interesting to analyze a core dump after the latest change, but I'm not sure how to do that on Gentoo. Is it easy to setup an environment on docker with the binaries and debug symbols ?

Oct 07 '22 10:10 xhernandez

Those four nodes effectively do it automatically, I've just wiped the "core archive" and will post the next ones.

We had one crash on a brick that looked similarly confusing, but otherwise it's mostly clients.

My theory is that it's faster to clean entries from the table on the bricks due not involving the kernel round-trip via fuse, but since I don't understand the process I'm barking up random trees with this statement.

In short, it's easier to maintain the 64k limit in the table due to no lru-limit parameter being set, and definitely no invalidate-limit.

So as invalidate-limit decreases, the table can grow far over lru-limit, performance suffers greatly when invalidate-limit goes too high, so we found anything >16 basically the system becomes too slow (with increased table size this may require tuning again).

With the bricks not having that limit the number of entries in the table should effectively remain sub or close to lru-limit (64k by default), as such, chain contention in the hash table shouldn't be too high (average chain length = 1). Thus why I suspect bricks don't crash.

Further to this, if a node gets selected for purging from the fuse side, as I understand a message is sent to the kernel to purge the entry, which then sends a command back to the process via FUSE to actually do the work. There is thus a longer time-span involved here, and if there is something in this selection process that gets set, but something happens on the inode in the meantime ...

Basically what you should take from the above is that I have no clue how this all works and really could do with some guidance on how to best assist you to help/guide me. As shown, I don't mind digging into the code, but I kinda need to know what I'm looking for.

Oct 07 '22 13:10 jkroonza

@jkroonza does your workload contain a lot of renames ?

Maildir lives on renames yes!

Is your volume configuration stable, or do you change some setting frequently (at least once between crashes) ?

No, we haven't changed settings in quite a while.

Do you generate statedumps or do any special operation periodically ?

Try not to, but we definitely can generate a few from time to time. How frequently would be sufficient for your requirements?

As a rule we don't do anything with glusterfs, except when we detect that the mount has failed, at which point we umount -fl and mount again. This is a disruptive operation.

And one more thing: do you get crashes in bricks and clients, or only on clients ?

We've seen one brick crash that I can remember, at least since updating to glusterfs 10.2 - prior to that we had some issues, we ended up having to recreate the volume and reconstruct the data from the underlying bricks. This was due to renames being poorly handled resulting in broken T link files in many, many cases, so now we just use a simple replicate pair, which has other drawbacks (the theory is that distribute improves performance ... but this was a killer for us).

Clients (We only use FUSE) crash consistently.

If both crash, is the frequency similar ?

Not by a long shot (my theory on the matter in previous post).

I'll post the core dumps as and when they are made. Is the backtraces sufficient or do you need the actual core dumps too? I can provide the split debug symbol files too if that'll help in any way. Would prefer to not post the full cores (as these may contain confidential data) publicly. The stack traces which only contains pointer values and the like are fine. I can arrange for you to obtain these via alternative mechanism.

Oct 07 '22 13:10 jkroonza

@jkroonza - I assume this is without cherry-picking https://github.com/gluster/glusterfs/pull/3226 which I remember you've looked at? IOW, just the hash changes?

Oct 07 '22 18:10 mykaul

@jkroonza - I assume this is without cherry-picking #3226 which I remember you've looked at? IOW, just the hash changes?

Correct. Cherry-picking #3226 is non trivial.

Weekend stacks, seems like not all of our nodes has the debug symbols though - will look into that when I get a chance.

stack-20221009-142517-core-glfs_epoll001.18539.1665318317.txt stack-20221008-052302-core-glfs_epoll000.25431.1665199382.txt stack-20221010-015732-core-glfs_epoll001.22105.1665359852.txt

What does bug me is that had a crash a day again effectively over the weekend. I suspect probably memory corruption:

Backtrace stopped: Cannot access memory at address 0x1e680

This one went down with SIGABRT, although I'd expect SIGSEGV in the case where memory is inaccessible.

One went down in tcmalloc during __gf_calloc call, ptr = NULL, which I know glibc's calloc has no issue with, however, what if tcmalloc does? In other words - should I consider switching back to glibc malloc implementation? Perhaps on half the nodes?

The other one went down during __gf_free from mem-pool.c:363 ... seems to be GF_ASSERT(GF_MEM_TRAILER_MAGIC == __gf_mem_trailer_read(trailer)) failure ... which it states is indicative of a memory overrun. Full stack for this particular thread (leading numbers are line-numbers from the trace stack-20221010-015732-core-glfs_epoll001.22105.1665359852.txt above):

555 #0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
561 #1  0x00007f96bcd0a33f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
563 #2  0x00007f96bccbe712 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
565 #3  0x00007f96bcca9469 in __GI_abort () at abort.c:79
569 #4  0x00007f96bcca9395 in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:92
572 #5  0x00007f96bccb7972 in __GI___assert_fail (assertion=assertion@entry=0x7f96bd1ad8a0 "0xBAADF00D == __gf_mem_trailer_read(trailer)", file=file@entry=0x7f96bd1ad812 "mem-pool.c", line=line@entry=363, function=function@entry=0x7f96bd1ad8d0 <__PRETTY_FUNCTION__.0> "__gf_free") at assert.c:101
574 #6  0x00007f96bd129da1 in __gf_free (free_ptr=0x55d8abae21a8) at mem-pool.c:363
582 #7  0x00007f96bd0f60c8 in data_destroy (data=0x55d8704710b8) at dict.c:315
584 #8  0x00007f96bd0f6900 in dict_clear_data (this=0x55d85c2a5af8) at dict.c:730
589 #9  dict_destroy (this=0x55d85c2a5af8) at dict.c:757
594 #10 0x00007f96bd0f6a25 in dict_unref (this=<optimized out>) at dict.c:801
597 #11 0x00007f96b789c6be in afr_changelog_do (frame=frame@entry=0x55d89fd46108, this=this@entry=0x55d84d37d628, xattr=xattr@entry=0x55d85abd9958, changelog_resume=changelog_resume@entry=0x7f96b789b370 <afr_changelog_post_op_done>, op=op@entry=AFR_TRANSACTION_POST_OP) at afr-transaction.c:1812
606 #12 0x00007f96b789e0a3 in afr_changelog_post_op_do (frame=0x55d89fd46108, this=0x55d84d37d628) at afr-transaction.c:1443
616 #13 0x00007f96b789f72f in afr_delayed_changelog_wake_up_cbk (data=<optimized out>) at afr-transaction.c:2348
622 #14 0x00007f96b78c4b7b in afr_delayed_changelog_wake_resume (this=this@entry=0x55d84d37d628, inode=0x55d857f1a028, stub=0x55d8a996ce28) at /var/tmp/portage/sys-cluster/glusterfs-10.2-r2/work/glusterfs-10.2/xlators/cluster/afr/src/afr-common.c:4290
627 #15 0x00007f96b78c979c in afr_flush (frame=frame@entry=0x55d84f04fc28, this=this@entry=0x55d84d37d628, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at /var/tmp/portage/sys-cluster/glusterfs-10.2-r2/work/glusterfs-10.2/xlators/cluster/afr/src/afr-common.c:4319
632 #16 0x00007f96bd18c355 in default_flush (frame=frame@entry=0x55d84f04fc28, this=this@entry=0x55d84d37e828, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at defaults.c:2531
638 #17 0x00007f96bd18c355 in default_flush (frame=0x55d84f04fc28, this=<optimized out>, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at defaults.c:2531
644 #18 0x00007f96b779b6dd in wb_flush_helper (frame=0x55d8665ec1e8, this=0x55d84d409228, fd=0x55d8a4acf208, xdata=0x0) at write-behind.c:1996
656 #19 0x00007f96bd126b3d in call_resume_keep_stub (stub=0x55d85430ee28) at call-stub.c:2453
...

This seems to relate to write-behind, which depending on how renames happen could be because the final write happens after the rename? There are also cases of what we refer to as dotlock files which basically is a sequence like:

Create temp file, and write certain process information into it.
rename(2) the file to the lockfile name. If this succeeds, lock taken, if it fails, unlink, wait a random time and retry.

There are caveats in here like maintaining lock freshness etc ... but but I'm not sure this is relevant. Which if write-behind happens after either the rename or the unlink?

bagheera [07:23:30] ~ # gluster volume get mail all | grep write-behind
performance.write-behind-window-size     1MB (DEFAULT)                          
performance.nfs.write-behind-window-size 1MB (DEFAULT)                          
performance.write-behind-trickling-writes on (DEFAULT)                           
performance.nfs.write-behind-trickling-writes on (DEFAULT)                           
performance.write-behind-pass-through    false (DEFAULT)                        
performance.write-behind                 on                                     
performance.nfs.write-behind             on

So call to be made:

Disable write-behind: performance.write-behind-pass-through yes, performance.write-behind no.
Recompile to glibc malloc (ie, no tcmalloc) - half the nodes.
Both of the above. Since 1 is a global setting ... it'll affect all nodes, so if we do both, and all crashes stop we know it's write-behind, if only the tcmalloc dissabled nodes stop, we know it's tcmalloc, if neither stops, we know it's neither.

Does this makes sense, recommendations?

Regarding the other possibility, dict.c, there is a comment in dict_clear_data about having to be called with this->lock held, however, dict_unref does NOT do that, given that (in theory) nothing else should hold pointers to this I do not believe that this should be a problem?

Oct 10 '22 05:10 jkroonza

@jkroonza - I assume this is without cherry-picking #3226 which I remember you've looked at? IOW, just the hash changes?

Correct. Cherry-picking #3226 is non trivial.

Weekend stacks, seems like not all of our nodes has the debug symbols though - will look into that when I get a chance.

stack-20221009-142517-core-glfs_epoll001.18539.1665318317.txt stack-20221008-052302-core-glfs_epoll000.25431.1665199382.txt stack-20221010-015732-core-glfs_epoll001.22105.1665359852.txt

What does bug me is that had a crash a day again effectively over the weekend. I suspect probably memory corruption:

Backtrace stopped: Cannot access memory at address 0x1e680

Most probably this happens because symbols are not present and gdb is not decoding the stack correctly.

This one went down with SIGABRT, although I'd expect SIGSEGV in the case where memory is inaccessible.

One went down in tcmalloc during __gf_calloc call, ptr = NULL, which I know glibc's calloc has no issue with, however, what if tcmalloc does? In other words - should I consider switching back to glibc malloc implementation? Perhaps on half the nodes?

I don't see any issue here with ptr. The only reason why ptr is NULL is because the calloc() request was still being processed when the crash happened. The most likely reason seems to be memory corruption.

The other one went down during __gf_free from mem-pool.c:363 ... seems to be GF_ASSERT(GF_MEM_TRAILER_MAGIC == __gf_mem_trailer_read(trailer)) failure ... which it states is indicative of a memory overrun. Full stack for this particular thread (leading numbers are line-numbers from the trace stack-20221010-015732-core-glfs_epoll001.22105.1665359852.txt above):

555 #0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
561 #1  0x00007f96bcd0a33f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
563 #2  0x00007f96bccbe712 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
565 #3  0x00007f96bcca9469 in __GI_abort () at abort.c:79
569 #4  0x00007f96bcca9395 in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:92
572 #5  0x00007f96bccb7972 in __GI___assert_fail (assertion=assertion@entry=0x7f96bd1ad8a0 "0xBAADF00D == __gf_mem_trailer_read(trailer)", file=file@entry=0x7f96bd1ad812 "mem-pool.c", line=line@entry=363, function=function@entry=0x7f96bd1ad8d0 <__PRETTY_FUNCTION__.0> "__gf_free") at assert.c:101
574 #6  0x00007f96bd129da1 in __gf_free (free_ptr=0x55d8abae21a8) at mem-pool.c:363
582 #7  0x00007f96bd0f60c8 in data_destroy (data=0x55d8704710b8) at dict.c:315
584 #8  0x00007f96bd0f6900 in dict_clear_data (this=0x55d85c2a5af8) at dict.c:730
589 #9  dict_destroy (this=0x55d85c2a5af8) at dict.c:757
594 #10 0x00007f96bd0f6a25 in dict_unref (this=<optimized out>) at dict.c:801
597 #11 0x00007f96b789c6be in afr_changelog_do (frame=frame@entry=0x55d89fd46108, this=this@entry=0x55d84d37d628, xattr=xattr@entry=0x55d85abd9958, changelog_resume=changelog_resume@entry=0x7f96b789b370 <afr_changelog_post_op_done>, op=op@entry=AFR_TRANSACTION_POST_OP) at afr-transaction.c:1812
606 #12 0x00007f96b789e0a3 in afr_changelog_post_op_do (frame=0x55d89fd46108, this=0x55d84d37d628) at afr-transaction.c:1443
616 #13 0x00007f96b789f72f in afr_delayed_changelog_wake_up_cbk (data=<optimized out>) at afr-transaction.c:2348
622 #14 0x00007f96b78c4b7b in afr_delayed_changelog_wake_resume (this=this@entry=0x55d84d37d628, inode=0x55d857f1a028, stub=0x55d8a996ce28) at /var/tmp/portage/sys-cluster/glusterfs-10.2-r2/work/glusterfs-10.2/xlators/cluster/afr/src/afr-common.c:4290
627 #15 0x00007f96b78c979c in afr_flush (frame=frame@entry=0x55d84f04fc28, this=this@entry=0x55d84d37d628, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at /var/tmp/portage/sys-cluster/glusterfs-10.2-r2/work/glusterfs-10.2/xlators/cluster/afr/src/afr-common.c:4319
632 #16 0x00007f96bd18c355 in default_flush (frame=frame@entry=0x55d84f04fc28, this=this@entry=0x55d84d37e828, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at defaults.c:2531
638 #17 0x00007f96bd18c355 in default_flush (frame=0x55d84f04fc28, this=<optimized out>, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at defaults.c:2531
644 #18 0x00007f96b779b6dd in wb_flush_helper (frame=0x55d8665ec1e8, this=0x55d84d409228, fd=0x55d8a4acf208, xdata=0x0) at write-behind.c:1996
656 #19 0x00007f96bd126b3d in call_resume_keep_stub (stub=0x55d85430ee28) at call-stub.c:2453
...

This effectively does seem memory corruption, but I think it's more likely the effect of a use-after-free problem than a memory overrun. A use-after-free could also explain why there are so many crashes with different components and locations.

This seems to relate to write-behind, which depending on how renames happen could be because the final write happens after the rename?

I don't think so. Pending writes are always processed before closing the file, so if the "close" happens before the "rename", there is no problem. But even if the rename is done before closing the file, writes are sent to an open fd, which translates into an inode, which doesn't change even after the rename.

There are also cases of what we refer to as dotlock files which basically is a sequence like:

Create temp file, and write certain process information into it.

rename(2) the file to the lockfile name. If this succeeds, lock taken, if it fails, unlink, wait a random time and retry.

I don't understand this process. rename(2) does not fail if the target file already exists. It simply overwrites it. Only renameat2() can cause failures in case the target already exists. Is that what do you mean or am I missing something else ?

There are caveats in here like maintaining lock freshness etc ... but but I'm not sure this is relevant. Which if write-behind happens after either the rename or the unlink?

It shouldn't matter unless there's a bug. As I said, write-behind uses an open fd, so it should be immune to renames and unlinks.

bagheera [07:23:30] ~ # gluster volume get mail all | grep write-behind
performance.write-behind-window-size     1MB (DEFAULT)                          
performance.nfs.write-behind-window-size 1MB (DEFAULT)                          
performance.write-behind-trickling-writes on (DEFAULT)                           
performance.nfs.write-behind-trickling-writes on (DEFAULT)                           
performance.write-behind-pass-through    false (DEFAULT)                        
performance.write-behind                 on                                     
performance.nfs.write-behind             on

So call to be made:

Disable write-behind: performance.write-behind-pass-through yes, performance.write-behind no.

You can try. From the functional point of view it shouldn't matter. However this can have a significant performance hit for massive write workloads.

Recompile to glibc malloc (ie, no tcmalloc) - half the nodes.

This shouldn't matter and I don't think it's related to the crashes. Also, tcmalloc seems to be more efficient, so performance is better with tcmalloc than glibc.

Both of the above. Since 1 is a global setting ... it'll affect all nodes, so if we do both, and all crashes stop we know it's write-behind, if only the tcmalloc dissabled nodes stop, we know it's tcmalloc, if neither stops, we know it's neither.

Note that even if write-behind operates incorrectly in the mentioned cases, this could cause some errors or weird effects regarding where data is written, but it shouldn't cause crashes. I think there's a memory corruption problem, which could be anywhere.

Does this makes sense, recommendations?

I think that the real problem here is memory corruption probably caused by use-after-free. Probably we have some object that gets released (and potentially reused) while some thread is still using it.

Regarding the other possibility, dict.c, there is a comment in dict_clear_data about having to be called with this->lock held, however, dict_unref does NOT do that, given that (in theory) nothing else should hold pointers to this I do not believe that this should be a problem?

It shouldn't, but certainly there seems to be a problem here.

The most interesting core dump to analyze seems to be this one: stack-20221010-015732-core-glfs_epoll001.22105.1665359852.txt. I'm not sure if you can provide it to me. However I don't have any idea how to analyze it. Your system is Gentoo and I don't have any experience on how to replicate the environment so that gdb is able to correctly analyze it. Can you help me here ?

Oct 10 '22 09:10 xhernandez

Most probably this happens because symbols are not present and gdb is not decoding the stack correctly.

Agreed. Funny enough all systems do have the debug symbols installed, so I'm not sure why gdb isn't picking it up on the one. Will look into that now.

Regarding reproduction:

garmr [14:38:17] /etc/portage (master) # grep -r glusterfs .
...
./package.use/99glusterfs -* debug fuse python_single_target_python3_10 tcmalloc
./package.env/custom:sys-cluster/glusterfs debugsyms
garmr [14:39:47] /etc/portage (master) # cat env/debugsyms 
CFLAGS="${CFLAGS} -ggdb"
CXXFLAGS="${CXXFLAGS} -ggdb"
FEATURES="${FEATURES} splitdebug compressdebug -nostrip"
USE="debug"

Of course installing gentoo itself is quite a mission in and by itself if you're not familiar with the process.

This one went down with SIGABRT, although I'd expect SIGSEGV in the case where memory is inaccessible. One went down in tcmalloc during __gf_calloc call, ptr = NULL, which I know glibc's calloc has no issue with, however, what if tcmalloc does? In other words - should I consider switching back to glibc malloc implementation? Perhaps on half the nodes?

I don't see any issue here with ptr. The only reason why ptr is NULL is because the calloc() request was still being processed when the crash happened. The most likely reason seems to be memory corruption.

As stated, calloc on NULL should be equivalent to malloc.

This is the thread that crashed, which yes I agree can be caused by use-after free which then corrupts tcmallocs internal data structures.

The other one went down during __gf_free from mem-pool.c:363 ... seems to be GF_ASSERT(GF_MEM_TRAILER_MAGIC == __gf_mem_trailer_read(trailer)) failure ... which it states is indicative of a memory overrun. Full stack for this particular thread (leading numbers are line-numbers from the trace stack-20221010-015732-core-glfs_epoll001.22105.1665359852.txt above):

555 #0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
561 #1  0x00007f96bcd0a33f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at pthread_kill.c:78
563 #2  0x00007f96bccbe712 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
565 #3  0x00007f96bcca9469 in __GI_abort () at abort.c:79
569 #4  0x00007f96bcca9395 in __assert_fail_base (fmt=<optimized out>, assertion=<optimized out>, file=<optimized out>, line=<optimized out>, function=<optimized out>) at assert.c:92
572 #5  0x00007f96bccb7972 in __GI___assert_fail (assertion=assertion@entry=0x7f96bd1ad8a0 "0xBAADF00D == __gf_mem_trailer_read(trailer)", file=file@entry=0x7f96bd1ad812 "mem-pool.c", line=line@entry=363, function=function@entry=0x7f96bd1ad8d0 <__PRETTY_FUNCTION__.0> "__gf_free") at assert.c:101
574 #6  0x00007f96bd129da1 in __gf_free (free_ptr=0x55d8abae21a8) at mem-pool.c:363
582 #7  0x00007f96bd0f60c8 in data_destroy (data=0x55d8704710b8) at dict.c:315
584 #8  0x00007f96bd0f6900 in dict_clear_data (this=0x55d85c2a5af8) at dict.c:730
589 #9  dict_destroy (this=0x55d85c2a5af8) at dict.c:757
594 #10 0x00007f96bd0f6a25 in dict_unref (this=<optimized out>) at dict.c:801
597 #11 0x00007f96b789c6be in afr_changelog_do (frame=frame@entry=0x55d89fd46108, this=this@entry=0x55d84d37d628, xattr=xattr@entry=0x55d85abd9958, changelog_resume=changelog_resume@entry=0x7f96b789b370 <afr_changelog_post_op_done>, op=op@entry=AFR_TRANSACTION_POST_OP) at afr-transaction.c:1812
606 #12 0x00007f96b789e0a3 in afr_changelog_post_op_do (frame=0x55d89fd46108, this=0x55d84d37d628) at afr-transaction.c:1443
616 #13 0x00007f96b789f72f in afr_delayed_changelog_wake_up_cbk (data=<optimized out>) at afr-transaction.c:2348
622 #14 0x00007f96b78c4b7b in afr_delayed_changelog_wake_resume (this=this@entry=0x55d84d37d628, inode=0x55d857f1a028, stub=0x55d8a996ce28) at /var/tmp/portage/sys-cluster/glusterfs-10.2-r2/work/glusterfs-10.2/xlators/cluster/afr/src/afr-common.c:4290
627 #15 0x00007f96b78c979c in afr_flush (frame=frame@entry=0x55d84f04fc28, this=this@entry=0x55d84d37d628, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at /var/tmp/portage/sys-cluster/glusterfs-10.2-r2/work/glusterfs-10.2/xlators/cluster/afr/src/afr-common.c:4319
632 #16 0x00007f96bd18c355 in default_flush (frame=frame@entry=0x55d84f04fc28, this=this@entry=0x55d84d37e828, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at defaults.c:2531
638 #17 0x00007f96bd18c355 in default_flush (frame=0x55d84f04fc28, this=<optimized out>, fd=fd@entry=0x55d8a4acf208, xdata=xdata@entry=0x0) at defaults.c:2531
644 #18 0x00007f96b779b6dd in wb_flush_helper (frame=0x55d8665ec1e8, this=0x55d84d409228, fd=0x55d8a4acf208, xdata=0x0) at write-behind.c:1996
656 #19 0x00007f96bd126b3d in call_resume_keep_stub (stub=0x55d85430ee28) at call-stub.c:2453
...

This effectively does seem memory corruption, but I think it's more likely the effect of a use-after-free problem than a memory overrun. A use-after-free could also explain why there are so many crashes with different components and locations.

OK. This makes perfect sense to me. The question is how can we track it?

This seems to relate to write-behind, which depending on how renames happen could be because the final write happens after the rename?

I don't think so. Pending writes are always processed before closing the file, so if the "close" happens before the "rename", there is no problem. But even if the rename is done before closing the file, writes are sent to an open fd, which translates into an inode, which doesn't change even after the rename.

There are also cases of what we refer to as dotlock files which basically is a sequence like:

Create temp file, and write certain process information into it.

rename(2) the file to the lockfile name. If this succeeds, lock taken, if it fails, unlink, wait a random time and retry.

I don't understand this process. rename(2) does not fail if the target file already exists. It simply overwrites it. Only renameat2() can cause failures in case the target already exists. Is that what do you mean or am I missing something else ?

Actually seems to use link(2), algorithm is described in lockfile_create. I believe most libraries don't vary the wait time, and I highly doubt they wait 5 seconds initially even. But there you have it. So file is created, linked to the lockfile (return value ignored). stat on both, if they're the same file, success, else retry. On failure certain unpsecified validity checks is done on the lock file, eg, age could be a check under the assumption no one may hold the lock for longer than two minutes without touch()ing it.

There are caveats in here like maintaining lock freshness etc ... but but I'm not sure this is relevant. Which if write-behind happens after either the rename or the unlink?

It shouldn't matter unless there's a bug. As I said, write-behind uses an open fd, so it should be immune to renames and unlinks.

open fd where? Does an inode entry on the brick guarantee that there will be an open fd on a brick?

So call to be made:

Disable write-behind: performance.write-behind-pass-through yes, performance.write-behind no.

You can try. From the functional point of view it shouldn't matter. However this can have a significant performance hit for massive write workloads.

Recompile to glibc malloc (ie, no tcmalloc) - half the nodes.

This shouldn't matter and I don't think it's related to the crashes. Also, tcmalloc seems to be more efficient, so performance is better with tcmalloc than glibc.

Our own testing says a lot more efficient without quantifying the results.

Both of the above. Since 1 is a global setting ... it'll affect all nodes, so if we do both, and all crashes stop we know it's write-behind, if only the tcmalloc dissabled nodes stop, we know it's tcmalloc, if neither stops, we know it's neither.

Note that even if write-behind operates incorrectly in the mentioned cases, this could cause some errors or weird effects regarding where data is written, but it shouldn't cause crashes. I think there's a memory corruption problem, which could be anywhere.

Does this makes sense, recommendations?

I think that the real problem here is memory corruption probably caused by use-after-free. Probably we have some object that gets released (and potentially reused) while some thread is still using it.

I agree. This makes sense. So how do we track it, because it seems we're either the only people running into this (which would indicate it's in an OP that's infrequently used, and it's probably racey too). We run maildir on top of glusterfs, which is heavily reliant on rename operations compared to most other workloads.

For example, email delivery process for delivery into /var/spool/mail/jk/jkroon's INBOX. - which sits at same folder, and consists of two sub-folders namely cur/, new/ and tmp/.

Create appropriately named mailbox uniquely named file in tmp/ folder.
Populate the file.
rename() it into new/.

The mail client process can do a bunch of things depending on the mail client, but amongst others, on first scan of new/ will rename() it into cur/ again.

This sounded excessively complex to me the first few times, but there are good reasons for it, specifically in new/ no "flags" are set on the filename yet, upon rename() into cur/ certain things happen, and this also depends on the retrieving protocol and mail client, but typically the seen flag will be set, and a UUID will be allocated in a separate client-specific database, in the case of POP3 a simpler UID mechanism is used to accommodate the protocol.

These database files are generally updated under dotlockfiles (as created by lockfile_create, although, I'm not aware of a great many clients using that specific library function, mostly it's just self-implemented as the algorithm is simple enough and developers want control over things like the timeouts).

Once the dotlock file is obtained, a replacement database is created in tmp/ and then rename()d into place prior to the dotlock file being released.

The point being that rename() calls are par for the course. They're used heavily. And given that my understanding is the majority of use of glusterfs is to provide "block devices" for VMs, this is definitely an "unusual" use-case.

Regarding the other possibility, dict.c, there is a comment in dict_clear_data about having to be called with this->lock held, however, dict_unref does NOT do that, given that (in theory) nothing else should hold pointers to this I do not believe that this should be a problem?

It shouldn't, but certainly there seems to be a problem here.

The most interesting core dump to analyze seems to be this one: stack-20221010-015732-core-glfs_epoll001.22105.1665359852.txt. I'm not sure if you can provide it to me. However I don't have any idea how to analyze it. Your system is Gentoo and I don't have any experience on how to replicate the environment so that gdb is able to correctly analyze it. Can you help me here ?

I can provide supervised access, but we'll need to arrange for that, my email is [email protected] - for the purpose of getting access to the core dump, please do contact me there, I can provide the raw core, along with the symbol files as an alternative, but that may end up being more effort than just giving you supervised access to the system. I will need an ssh key for you.

Btw, I do appreciate the interest in this, we've been struggling with this since early in the year and we've not made any progress. From what I do recall though there are certain use-cases whereby we can switch off things that does concurrency (like fuse-readers and epoll threads - can't even recall their names - and we'd have no crashes, but performance would suck so badly customers would start getting time-outs on operations). Seeing that the fuse processes now generally runs <50% of a core (Used to be somewhat more), perhaps we can try that again on one or two of the nodes.

Oct 10 '22 14:10 jkroonza

glusterfs glusterfs copied to clipboard

fuse concurrency problems

glusterfs
glusterfs copied to clipboard