glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Inode table size fixes

Open jkroonza opened this issue 3 years ago • 25 comments
trafficstars

This is incomplete as of yet, and still requires the ability to actually set the hash table size via some option or another (ideally on a per fuse mount basis). Still, posting this here so long for feedback.

This has two purposes:

  1. We're hoping to lower the amount of time spent traversing the table chains (which in our use case is at average 16 with ~1m entries hashed into 64k buckets).
  2. Mitigating the frequent crashes we're seeing when lru_limit > 64k. We're hoping that by increasing the table size we can at least limit the number of crashes, being fully aware we won't solve it, but if we can get it down to 1 a week compared to 2/day over 4 nodes currently ... that would buy us time.

jkroonza avatar Aug 16 '22 11:08 jkroonza

Can one of the admins verify this patch?

gluster-ant avatar Aug 16 '22 11:08 gluster-ant

Can one of the admins verify this patch?

gluster-ant avatar Aug 16 '22 11:08 gluster-ant

Can one of the admins verify this patch?

gluster-ant avatar Aug 16 '22 11:08 gluster-ant

CLANG-FORMAT FAILURE: Before merging the patch, this diff needs to be considered for passing clang-format

index d4b9661e4..62b5dcf6d 100644
--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -176,7 +176,7 @@ hash_dentry(inode_t *parent, const char *name, int mod)
 static int
 hash_gfid(uuid_t uuid, int mod)
 {
-    return ((int*)uuid)[1] & (1U >> 1) % mod;
+    return ((int *)uuid)[1] & (1U >> 1) % mod;
 }
 
 static void

gluster-ant avatar Aug 16 '22 11:08 gluster-ant

Unrelated to the above, you may want to look at:

  1. https://github.com/gluster/glusterfs/pull/3226
  2. See if you can skip the hash - for example, in inode_find(), you can check for root before the hash calculation, or taking the lock. This may or may not slightly help for your use case.
  3. There's some waste in keep 64K list objects where you actually need a single pointer for a list. Not sure it'd help much, but will save some memory (8 bytes * 64K) - I think it's one of those we never actually traverse, so we don't need next AND prev.

mykaul avatar Aug 16 '22 12:08 mykaul

@jkroonza Have you analyzed what is the improvement you are seeing after changing the hash function?

mohit84 avatar Aug 16 '22 12:08 mohit84

@jkroonza Have you analyzed what is the improvement you are seeing after changing the hash function?

Hi @mohit84

No I have not. I highly doubt this will at this point make a significant difference, we're probably saving 1, maybe 2 CPU cycles per hash. Which is irrelevant in my opinion. This is absolutely dwarfed by factors like the division to get the modulo.

Keep in mind that the primary purpose behind this initial commit is that gfid_hash really should be able to return values >0xffff such that larger hash table sizes makes sense. With the old mechanism, even if the mod argument >65536 no value >= 65536 would be returned due to only the last two bytes of the UUID value being taken into consideration.

Thereafter it may become possible to measure the efficiency of different hash functions, which, to say the least will be non-trivial:

  1. Do you just measure the hashing speed?
  2. Do you (and how) do you measure the distribution of hashes (since no matter what we do here the bucket-list traversal will remain the larger performance impactor, so better distribution == better overall performance.

And lastly, performance, whilst a concern, is not my primary. I need to reduce the occurence of #3659 - and we have since proven this isn't isolated fuse, however, the frequency of occurrence directly correlates to the lru_limit value, so at lru_limit=512k we see (on four nodes) approximately one crash per day on the cluster, at 1m we're seeing generally two/day.

The only aspect that to me makes sense is that by increasing contention on the bucket lists we're increasing the probability of something going wrong (probably due to some lock issue somewhere which I can't find despite trying since I logged said issue. So this patch series has two purposes:

  1. Reduce the frequency and thus client impact of #3659; and
  2. Potentially improve overall lookup efficiency.

@mykaul

Unrelated to the above, you may want to look at:

1. [Multiple improvements to inode_link #3226](https://github.com/gluster/glusterfs/pull/3226)

This looks very interesting, will look over this more thoroughly when I can. Possibly you might have indirectly addressed our crashing issue by accident. I think I might recognise the problem when I see it, but I'm struggling to wrap my head over the overall inode.c design here. That's probably to a large degree because I'm learning this by way of stack traces rather than just reading the code like a book as I should (and as I've done to implement these relatively insignificant snippets).

2. See if you can skip the hash - for example, in inode_find(), you can check for root before the hash calculation, or taking the lock. This may or may not slightly help for your use case.

I suspect very few of our lookups hit root. Our primary headache on this is courier-imap (and to a lesser degree apache+php-fpm, which would certainly benefit from a larger lru_limit but we're leaving at the default 64k for the moment for the sake of avoiding FUSE crashes on web nodes where the impact most certainly will be more high profile), which will generally just open(2) a dir on startup, and then fopenat(2) relative to that fd in general. And as stated, primary purpose isn't performance (although, cutting back on even half a core of CPU per node certainly won't go amiss given that we're generally burning 3-4 cores on the two primary brick nodes, and 1 to 1.5 on the two fuse-mount only nodes).

3. There's some waste in keep 64K list objects where you actually need a single pointer for a list. Not sure it'd help much, but will save some memory (8 bytes * 64K) - I think it's one of those we never actually traverse, so we don't need next AND prev.

Try 1m objects overall, lru_limit < 1m tends to consistently be busy invalidating items, and with invalidate-limit > 64 this massively impacts performance (to the point we're already discussing dropping this to 32). So that's around 8MB of memory of 1.7-2.2GB, so frankly, insignificant.

jkroonza avatar Aug 16 '22 20:08 jkroonza

/run regression

mykaul avatar Sep 14 '22 12:09 mykaul

@mykaul guess this can be merged as is, but it's really lacking a lot of other footwork that I've yet to start on to actually set >65536 hash table size (I'm aiming for 512k for starters).

jkroonza avatar Sep 14 '22 12:09 jkroonza

@mykaul guess this can be merged as is, but it's really lacking a lot of other footwork that I've yet to start on to actually set >65536 hash table size (I'm aiming for 512k for starters).

It really need to show value or at the very least, that it doesn't regress performance though (Mohit asked for it earlier). BTW, thinking about it some more - perhaps we should change the way we do the hashing to linear address hashing (https://en.wikipedia.org/wiki/Linear_probing )?

mykaul avatar Sep 14 '22 13:09 mykaul

@mykaul guess this can be merged as is, but it's really lacking a lot of other footwork that I've yet to start on to actually set >65536 hash table size (I'm aiming for 512k for starters).

It really need to show value or at the very least, that it doesn't regress performance though (Mohit asked for it earlier). BTW, thinking about it some more - perhaps we should change the way we do the hashing to linear address hashing (https://en.wikipedia.org/wiki/Linear_probing )?

I would not. Consider that we currently run with >1m entries in that hashtable, so it would be completely overflown by way of that. The average chain length for us by implication is already 16. We're taking on another LARGE customer now so I'm expecting that this will need to go up to 2m entries in the next month or two (average chain length/bucket size 32). By implication linear probing for us would thus require the base hash table to have a size of at least 4m, but ideally closer to 10m (or about 80MB of RAM just for that base lookup table). With the current scheme I believe it's still fair amount of RAM (depending on exactly how the pointer manipulation is done the current scheme for 2m entries and a base table size of 512k would be between 12M and possibly 24M, but a really wasteful implementation perhaps as high as 48MB. This is significant in terms of percentage, but perhaps not as significant in terms of total space considering the servers in question has upwards of 16GB each, most of which are potentially available for glusterfs.

If you want to do that you need to ensure that the hash table size is at least 2x bigger than the number of allowable entries in order to retain even a remote level of performance, probably ideally more like 8x. Also, where an entry can simply be removed currently by removing it from the chain, in the case of linear probing, you'd have to iterate until you find a null value, with any number of entries up to that point which may require being moved backwards.

Alternatively the table size will need to be scaled dynamically (may be a requirement regardless for linear probing). Assuming a fairly low usage level this could provide for an overall improved performance, but the additional RAM utilization is definitely a factor to consider. And you need to maintain a relatively low population value in order to not run into potentially long search times (potentially much longer than what can be expected currently).

jkroonza avatar Sep 14 '22 14:09 jkroonza

Small comment on the memory size of the table - it would be smaller if we won't be using lists (16 bytes per entry - prev and next pointers) and just a single pointer per entry. It should also be faster to go over (theoretically, didn't test yet of course). BTW, one of the ideas of liner addressing is that the table grows dynamically, in the sense that you start small and you either double it or add 30% or so once it reaches some threshold.

Perhaps there are other optimizations to memory consumption ( gf_atomic_t nlookup and gf_atomic_t kids could be 32bit atomics and not 64?) per entry?

mykaul avatar Sep 14 '22 14:09 mykaul

1 test(s) failed ./tests/bugs/distribute/bug-1286171.t

0 test(s) generated core

4 test(s) needed retry ./tests/basic/afr/ta-shd.t ./tests/basic/ec/ec-1468261.t ./tests/basic/glusterd/thin-arbiter-volume-probe.t ./tests/bugs/distribute/bug-1286171.t https://build.gluster.org/job/gh_centos7-regression/2858/

gluster-ant avatar Sep 14 '22 14:09 gluster-ant

Small comment on the memory size of the table - it would be smaller if we won't be using lists (16 bytes per entry - prev and next pointers) and just a single pointer per entry. It should also be faster to go over (theoretically, didn't test yet of course). BTW, one of the ideas of liner addressing is that the table grows dynamically, in the sense that you start small and you either double it or add 30% or so once it reaches some threshold.

Perhaps there are other optimizations to memory consumption ( gf_atomic_t nlookup and gf_atomic_t kids could be 32bit atomics and not 64?) per entry?

Ok, so my 12MB calc was based on a base table of 512k entries, with 1m entries all in the same chain (extremely unlikely) with a next pointer embedded straight into the stored structure. So given a prev pointer that takes us to 20MB currently, and 28MB with the gf_atomic_t value at 64 bits. Since on a 64 bit system the compiler is likely to align the data to 64-bit boundaries in any case, and that's the more optimal memory retrieval boundary, I don't think you'll win much by decreasing that to 32-bits, unless you're also decreasing the size of some other 64-bit fields to 32-bits and can pack those 32-bit fields adjacent such that you actually save multiple values.

Our use-case is (from what I've deduced in this and other discussions) fairly extreme, pathological some would say, so frankly, I don't think any of this is worth the overall discussion.

Just getting rid of the prev pointer would already optimize the linked lists in terms of memory usage (8MB of RAM for our use-case), but more importantly, keeping both pointers up to date during addition and removal is crazy. Addition should be a simple pre-pend, where removal with a single-linked list implies we need the prev-element for fast removal. Depending on how this works we're either searching for the entry to be removed based on gfid and doing a linear search in some bucket in any case, and the prev pointer is useless, or we have a pointer to the structure to be removed, in which case the prev pointer helps to make the removal of entries constant time compared to linear time (because now we'd have to search through the bucket for the pointer to the entry).

jkroonza avatar Sep 14 '22 14:09 jkroonza

cat: /mnt/glusterfs/0/a.txt: Cannot allocate memory

ec-1468261.t:
ok  31 [   4647/    110] <  71> 'kill_brick patchy 172.31.14.233 /d/backends/patchy0'
[2022-09-14 13:20:15.218992 +0000] E [rpc-clnt.c:320:saved_frames_unwind] (--> /build/install/lib/libglusterfs.so.0(_gf_log_callingfn+0x1bc)[0x7f44ddab4dfc] (--> /build/install/lib/libgfrpc.so.0(+0x11a46)[0x7f44dd415a46] (--> /build/install/lib/libgfrpc.so.0(+0x11b04)[0x7f44dd415b04] (--> /build/install/lib/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x17f)[0x7f44dd4160cd] (--> /build/install/lib/libgfrpc.so.0(+0x12968)[0x7f44dd416968] ))))) 0-gf-attach-rpc: forced unwinding frame type(brick operations) op(--(1)) called at 2022-09-14 13:20:15 +0000 (xid=0x2)
got error -1 on RPC
ok  32 [     14/    113] <  72> 'kill_brick patchy 172.31.14.233 /d/backends/patchy1'
ok  33 [     14/    717] <  73> '4 ec_child_up_count patchy 0'
ok  34 [     64/      7] <  76> 'kill %1'
ok  35 [     23/    199] <  80> 'gluster --mode=script --wignore volume start patchy force'
ok  36 [     15/   2615] <  81> '6 ec_child_up_count patchy 0'
ok  37 [     40/   2645] <  84> '^0$ get_pending_heal_count patchy'
ok  38 [     14/    109] <  87> 'kill_brick patchy 172.31.14.233 /d/backends/patchy3'
ok  39 [     14/     96] <  88> 'kill_brick patchy 172.31.14.233 /d/backends/patchy4'
ok  40 [     14/     36] <  89> '4 ec_child_up_count patchy 0'
ok  41 [     15/     52] <  93> 'rm -rf /mnt/glusterfs/0/*'

thin-arbiter-volume-probe.t
unclear what the problem is here.

./tests/bugs/distribute/bug-1286171.t timed out after 200 seconds

How do I know if the change I made caused ny of this?

jkroonza avatar Sep 14 '22 14:09 jkroonza

CLANG-FORMAT FAILURE: Before merging the patch, this diff needs to be considered for passing clang-format

index c9e9744d6..c03588651 100644
--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -176,7 +176,7 @@ hash_dentry(inode_t *parent, const char *name, int mod)
 static int
 hash_gfid(uuid_t uuid, int mod)
 {
-    return ((int *)uuid)[0] & (mod -1);
+    return ((int *)uuid)[0] & (mod - 1);
 }
 
 static void
@@ -1708,20 +1708,21 @@ inode_table_with_invalidator(uint32_t lru_limit, xlator_t *xl,
         new->dentry_hashsize = dentry_hashsize;
     }
 
-	/* The size of hash table MUST always be power of 2 as required by hash_gfid */
+    /* The size of hash table MUST always be power of 2 as required by hash_gfid
+     */
     if (inode_hashsize == 0) {
         new->inode_hashsize = 65536;
     } else {
-		if (inode_hashsize < 0)
-			inode_hashsize = 1;
-
-		if (inode_hashsize & (inode_hashsize - 1)) {
-			/* not a power of two */
-			inode_hashsize <<= 1;
-			do {
-				inode_hashsize &= inode_hashsize - 1;
-			} while (inode_hashsize & (inode_hashsize - 1));
-		}
+        if (inode_hashsize < 0)
+            inode_hashsize = 1;
+
+        if (inode_hashsize & (inode_hashsize - 1)) {
+            /* not a power of two */
+            inode_hashsize <<= 1;
+            do {
+                inode_hashsize &= inode_hashsize - 1;
+            } while (inode_hashsize & (inode_hashsize - 1));
+        }
 
         new->inode_hashsize = inode_hashsize;
     }

gluster-ant avatar Sep 15 '22 06:09 gluster-ant

CLANG-FORMAT FAILURE: Before merging the patch, this diff needs to be considered for passing clang-format

index dea754a43..c03588651 100644
--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -1709,22 +1709,22 @@ inode_table_with_invalidator(uint32_t lru_limit, xlator_t *xl,
     }
 
     /* The size of hash table MUST always be power of 2 as required by hash_gfid
-    */
+     */
     if (inode_hashsize == 0) {
-	new->inode_hashsize = 65536;
+        new->inode_hashsize = 65536;
     } else {
-	if (inode_hashsize < 0)
-	    inode_hashsize = 1;
-
-	if (inode_hashsize & (inode_hashsize - 1)) {
-	    /* not a power of two */
-	    inode_hashsize <<= 1;
-	    do {
-		inode_hashsize &= inode_hashsize - 1;
-	    } while (inode_hashsize & (inode_hashsize - 1));
-	}
-
-	new->inode_hashsize = inode_hashsize;
+        if (inode_hashsize < 0)
+            inode_hashsize = 1;
+
+        if (inode_hashsize & (inode_hashsize - 1)) {
+            /* not a power of two */
+            inode_hashsize <<= 1;
+            do {
+                inode_hashsize &= inode_hashsize - 1;
+            } while (inode_hashsize & (inode_hashsize - 1));
+        }
+
+        new->inode_hashsize = inode_hashsize;
     }
 
     /* In case FUSE is initing the inode table. */

gluster-ant avatar Sep 15 '22 07:09 gluster-ant

CLANG-FORMAT FAILURE: Before merging the patch, this diff needs to be considered for passing clang-format

index c63b934c3..f5e19a89d 100644
--- a/glusterfsd/src/glusterfsd.c
+++ b/glusterfsd/src/glusterfsd.c
@@ -1121,7 +1121,8 @@ parse_opts(int key, char *arg, struct argp_state *state)
             if (!gf_string2int32(arg, &cmd_args->inode_table_size))
                 break;
 
-            argp_failure(state, -1, 0, "unknown inode table size option %s", arg);
+            argp_failure(state, -1, 0, "unknown inode table size option %s",
+                         arg);
             break;
 
         case ARGP_FUSE_INVALIDATE_LIMIT_KEY:
diff --git a/libglusterfs/src/glusterfs/glusterfs.h b/libglusterfs/src/glusterfs/glusterfs.h
index 3fa2e69b8..7d9251bec 100644
--- a/libglusterfs/src/glusterfs/glusterfs.h
+++ b/libglusterfs/src/glusterfs/glusterfs.h
@@ -63,7 +63,6 @@
 #define O_PATH 010000000 /* from asm-generic/fcntl.h */
 #endif
 
-
 #ifndef EBADFD
 /* Mac OS X does not have EBADFD */
 #define EBADFD EBADF
diff --git a/libglusterfs/src/inode.c b/libglusterfs/src/inode.c
index 34b371d3c..00634fb40 100644
--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -1715,12 +1715,14 @@ inode_table_with_invalidator(uint32_t lru_limit, xlator_t *xl,
         new->inode_hashsize = 65536;
     } else {
         new->inode_hashsize = inode_hashsize;
-        while ((diff = new->inode_hashsize & -new->inode_hashsize) != new->inode_hashsize)
+        while ((diff = new->inode_hashsize & -new->inode_hashsize) !=
+               new->inode_hashsize)
             new->inode_hashsize += diff;
 
         if (new->inode_hashsize != inode_hashsize)
             gf_log(THIS->name, GF_LOG_WARNING,
-                    "Rounded inode table size up to %lu from %u", new->inode_hashsize, inode_hashsize);
+                   "Rounded inode table size up to %lu from %u",
+                   new->inode_hashsize, inode_hashsize);
     }
 
     /* In case FUSE is initing the inode table. */
diff --git a/xlators/mount/fuse/src/fuse-bridge.c b/xlators/mount/fuse/src/fuse-bridge.c
index 90685c399..462f2ddca 100644
--- a/xlators/mount/fuse/src/fuse-bridge.c
+++ b/xlators/mount/fuse/src/fuse-bridge.c
@@ -6467,8 +6467,9 @@ fuse_graph_setup(xlator_t *this, glusterfs_graph_t *graph)
         }
 
 #if FUSE_KERNEL_MINOR_VERSION >= 11
-        itable = inode_table_with_invalidator(
-            priv->lru_limit, graph->top, fuse_inode_invalidate_fn, this, 0, priv->inode_table_size);
+        itable = inode_table_with_invalidator(priv->lru_limit, graph->top,
+                                              fuse_inode_invalidate_fn, this, 0,
+                                              priv->inode_table_size);
 #else
         itable = inode_table_new(0, graph->top, 0, 0);
 #endif
@@ -6900,7 +6901,8 @@ init(xlator_t *this_xl)
 
     GF_OPTION_INIT("lru-limit", priv->lru_limit, uint32, cleanup_exit);
 
-    GF_OPTION_INIT("inode-table-size", priv->inode_table_size, uint32, cleanup_exit);
+    GF_OPTION_INIT("inode-table-size", priv->inode_table_size, uint32,
+                   cleanup_exit);
 
     GF_OPTION_INIT("invalidate-limit", priv->invalidate_limit, uint32,
                    cleanup_exit);
@@ -7236,7 +7238,8 @@ struct volume_options options[] = {
         .type = GF_OPTION_TYPE_INT,
         .default_value = "65536",
         .min = 0,
-        .description = "sets the size of the inode hash table (must be a power of two, will be rounded up if not)",
+        .description = "sets the size of the inode hash table (must be a power "
+                       "of two, will be rounded up if not)",
     },
     {
         .key = {"invalidate-limit"},

gluster-ant avatar Sep 15 '22 12:09 gluster-ant

CLANG-FORMAT FAILURE: Before merging the patch, this diff needs to be considered for passing clang-format

index c63b934c3..f5e19a89d 100644
--- a/glusterfsd/src/glusterfsd.c
+++ b/glusterfsd/src/glusterfsd.c
@@ -1121,7 +1121,8 @@ parse_opts(int key, char *arg, struct argp_state *state)
             if (!gf_string2int32(arg, &cmd_args->inode_table_size))
                 break;
 
-            argp_failure(state, -1, 0, "unknown inode table size option %s", arg);
+            argp_failure(state, -1, 0, "unknown inode table size option %s",
+                         arg);
             break;
 
         case ARGP_FUSE_INVALIDATE_LIMIT_KEY:
diff --git a/libglusterfs/src/glusterfs/glusterfs.h b/libglusterfs/src/glusterfs/glusterfs.h
index 3fa2e69b8..7d9251bec 100644
--- a/libglusterfs/src/glusterfs/glusterfs.h
+++ b/libglusterfs/src/glusterfs/glusterfs.h
@@ -63,7 +63,6 @@
 #define O_PATH 010000000 /* from asm-generic/fcntl.h */
 #endif
 
-
 #ifndef EBADFD
 /* Mac OS X does not have EBADFD */
 #define EBADFD EBADF
diff --git a/libglusterfs/src/inode.c b/libglusterfs/src/inode.c
index 4c25a8280..3bfd8e305 100644
--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -1715,12 +1715,14 @@ inode_table_with_invalidator(uint32_t lru_limit, xlator_t *xl,
         new->inode_hashsize = 65536;
     } else {
         new->inode_hashsize = inode_hashsize;
-        while ((diff = new->inode_hashsize & -new->inode_hashsize) != new->inode_hashsize)
+        while ((diff = new->inode_hashsize & -new->inode_hashsize) !=
+               new->inode_hashsize)
             new->inode_hashsize += diff;
 
         if (new->inode_hashsize != inode_hashsize)
             gf_log(THIS->name, GF_LOG_WARNING,
-                    "Rounded inode table size up to %zu from %u", new->inode_hashsize, inode_hashsize);
+                   "Rounded inode table size up to %zu from %u",
+                   new->inode_hashsize, inode_hashsize);
     }
 
     /* In case FUSE is initing the inode table. */
diff --git a/xlators/mount/fuse/src/fuse-bridge.c b/xlators/mount/fuse/src/fuse-bridge.c
index 90685c399..462f2ddca 100644
--- a/xlators/mount/fuse/src/fuse-bridge.c
+++ b/xlators/mount/fuse/src/fuse-bridge.c
@@ -6467,8 +6467,9 @@ fuse_graph_setup(xlator_t *this, glusterfs_graph_t *graph)
         }
 
 #if FUSE_KERNEL_MINOR_VERSION >= 11
-        itable = inode_table_with_invalidator(
-            priv->lru_limit, graph->top, fuse_inode_invalidate_fn, this, 0, priv->inode_table_size);
+        itable = inode_table_with_invalidator(priv->lru_limit, graph->top,
+                                              fuse_inode_invalidate_fn, this, 0,
+                                              priv->inode_table_size);
 #else
         itable = inode_table_new(0, graph->top, 0, 0);
 #endif
@@ -6900,7 +6901,8 @@ init(xlator_t *this_xl)
 
     GF_OPTION_INIT("lru-limit", priv->lru_limit, uint32, cleanup_exit);
 
-    GF_OPTION_INIT("inode-table-size", priv->inode_table_size, uint32, cleanup_exit);
+    GF_OPTION_INIT("inode-table-size", priv->inode_table_size, uint32,
+                   cleanup_exit);
 
     GF_OPTION_INIT("invalidate-limit", priv->invalidate_limit, uint32,
                    cleanup_exit);
@@ -7236,7 +7238,8 @@ struct volume_options options[] = {
         .type = GF_OPTION_TYPE_INT,
         .default_value = "65536",
         .min = 0,
-        .description = "sets the size of the inode hash table (must be a power of two, will be rounded up if not)",
+        .description = "sets the size of the inode hash table (must be a power "
+                       "of two, will be rounded up if not)",
     },
     {
         .key = {"invalidate-limit"},

gluster-ant avatar Sep 15 '22 15:09 gluster-ant

Right, this is in line with what I want to do, I hope this clarifies the short-term goal. I do agree that this table can overall be improved in various ways as discussed, however, if Im right about the crashes we're seeing this should at a minimum alleviate the problem in that it occurs less frequently, whilst probably also improving performance on the FUSE side at the very least.

The bricks can probably also benefit from this but I've no idea where to add required config options at this stage, so would request this be considered as is, if this would be beneficial for bricks too in your opinion I can try to expose config options via gluster volume set.

jkroonza avatar Sep 16 '22 04:09 jkroonza

CLANG-FORMAT FAILURE: Before merging the patch, this diff needs to be considered for passing clang-format

index 3fa2e69b8..7d9251bec 100644
--- a/libglusterfs/src/glusterfs/glusterfs.h
+++ b/libglusterfs/src/glusterfs/glusterfs.h
@@ -63,7 +63,6 @@
 #define O_PATH 010000000 /* from asm-generic/fcntl.h */
 #endif
 
-
 #ifndef EBADFD
 /* Mac OS X does not have EBADFD */
 #define EBADFD EBADF
diff --git a/libglusterfs/src/inode.c b/libglusterfs/src/inode.c
index dc069c077..ef84e9579 100644
--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -1715,7 +1715,8 @@ inode_table_with_invalidator(uint32_t lru_limit, xlator_t *xl,
         if (inode_hashsize > 0)
             gf_log(THIS->name, GF_LOG_WARNING,
                    "Set inode table size to minimum value of 65536 rather than "
-                   "the configured %u", inode_hashsize);
+                   "the configured %u",
+                   inode_hashsize);
         new->inode_hashsize = 65536;
     } else {
         new->inode_hashsize = inode_hashsize;

gluster-ant avatar Sep 19 '22 10:09 gluster-ant

CLANG-FORMAT FAILURE: Before merging the patch, this diff needs to be considered for passing clang-format

index 3fa2e69b8..7d9251bec 100644
--- a/libglusterfs/src/glusterfs/glusterfs.h
+++ b/libglusterfs/src/glusterfs/glusterfs.h
@@ -63,7 +63,6 @@
 #define O_PATH 010000000 /* from asm-generic/fcntl.h */
 #endif
 
-
 #ifndef EBADFD
 /* Mac OS X does not have EBADFD */
 #define EBADFD EBADF
diff --git a/libglusterfs/src/inode.c b/libglusterfs/src/inode.c
index dc069c077..ef84e9579 100644
--- a/libglusterfs/src/inode.c
+++ b/libglusterfs/src/inode.c
@@ -1715,7 +1715,8 @@ inode_table_with_invalidator(uint32_t lru_limit, xlator_t *xl,
         if (inode_hashsize > 0)
             gf_log(THIS->name, GF_LOG_WARNING,
                    "Set inode table size to minimum value of 65536 rather than "
-                   "the configured %u", inode_hashsize);
+                   "the configured %u",
+                   inode_hashsize);
         new->inode_hashsize = 65536;
     } else {
         new->inode_hashsize = inode_hashsize;

I didn't introduce the extra newline, but can certainly drop it here.

Splitting the other line really does (in my opinion) not improve readability either, but happy to comply with clang-format here if need be.

FYI, simply running ./rfc.sh does all the basic checks, but it fails out on my configuration prior to doing so (I set up my repo for push and the main for pull, resulting in the script getting very, very confused). It's a bit backwards but works for me.

jkroonza avatar Sep 19 '22 10:09 jkroonza

Splitting the other line really does (in my opinion) not improve readability either, but happy to comply with clang-format here if need be.

It's not necessary. We use clang-format for some things that we decided are important (line limits, curly braces location, ...). But clang-format also formats other things that we don't care or we don't have a strong opinion on it. In that case, we ignore the "recommendation".

xhernandez avatar Sep 19 '22 11:09 xhernandez

/run regression

xhernandez avatar Sep 21 '22 09:09 xhernandez

1 test(s) failed ./tests/bugs/read-only/bug-1134822-read-only-default-in-graph.t

0 test(s) generated core

5 test(s) needed retry ./tests/000-flaky/bugs_core_multiplex-limit-issue-151.t ./tests/000-flaky/glusterd-restart-shd-mux.t ./tests/basic/afr/metadata-self-heal.t ./tests/bugs/glusterd/brick-order-check-add-brick.t ./tests/bugs/read-only/bug-1134822-read-only-default-in-graph.t https://build.gluster.org/job/gh_centos7-regression/2895/

gluster-ant avatar Sep 21 '22 13:09 gluster-ant

1 test(s) failed ./tests/bugs/read-only/bug-1134822-read-only-default-in-graph.t

0 test(s) generated core

5 test(s) needed retry ./tests/000-flaky/bugs_core_multiplex-limit-issue-151.t ./tests/000-flaky/glusterd-restart-shd-mux.t ./tests/basic/afr/metadata-self-heal.t ./tests/bugs/glusterd/brick-order-check-add-brick.t ./tests/bugs/read-only/bug-1134822-read-only-default-in-graph.t https://build.gluster.org/job/gh_centos7-regression/2895/

15:04:43 not ok 25 [ 14/ 3] < 65> '! rm -rf /mnt/glusterfs/0/zerobytefile2.txt' -> ''

read-only off, worm on

Does worm preclude rm? Or supposed to? I don't see how anything I did can influence this.

Anything else that needs handling?

jkroonza avatar Sep 22 '22 15:09 jkroonza

/run regression

xhernandez avatar Sep 23 '22 08:09 xhernandez

15:04:43 not ok 25 [ 14/ 3] < 65> '! rm -rf /mnt/glusterfs/0/zerobytefile2.txt' -> ''

read-only off, worm on

Does worm preclude rm? Or supposed to? I don't see how anything I did can influence this.

Sometimes we get spurious failures. I just triggered again the regression. Let's see if the problem persists.

xhernandez avatar Sep 23 '22 08:09 xhernandez

1 test(s) failed ./tests/bugs/replicate/bug-1250170-fsync.t

0 test(s) generated core

4 test(s) needed retry ./tests/000-flaky/basic_afr_split-brain-favorite-child-policy.t ./tests/000-flaky/glusterd-restart-shd-mux.t ./tests/bugs/glusterd/brick-order-check-add-brick.t ./tests/bugs/replicate/bug-1250170-fsync.t https://build.gluster.org/job/gh_centos7-regression/2899/

gluster-ant avatar Sep 23 '22 11:09 gluster-ant

It's seems there's always one :). Do we know what causes these sporadic (seemingly random) failures?

jkroonza avatar Sep 23 '22 12:09 jkroonza