glusterfs Replicated volume with

Description of problem: if I'm trying to create and mount client to glusterfs replicated volume with =< 16 replica nodes its all okay. but the problem is, when creating glusterfs replicated volume with number of bricks > 16 (e.g. number of bricks: 1 x 17 = 17) there is volume mount issue OR the problem occurs with adding one brick to 16 replicaded nodes volume (volume can't be extended or its extended, but only if its fresh volume with no mounts to it)

The exact command to reproduce the issue: first case: after peering probes and creating volume with

gluster volume create gv3 replica 17 gluster1:/data3 gluster2:/data3 gluster3:/data3 gluster4:/data3 gluster5:/data3 gluster6:/data3 gluster7:/data3 gluster8:/data3 gluster9:/data3 gluster10:/data3 gluster11:/data3 gluster12:/data3 gluster13:/data3 gluster14:/data3 gluster15:/data3 gluster16:/data3 gluster17:/data3 force
gluster volume start gv3

command that do not execute right is: mount -t glusterfs gluster1:/gv3 /mnt

second case is when adding one brick to 16 nodes replicated volume gluster volume add-brick gv3 replica 17 gluster17:/data3 force

The full output of the command that failed: output from mount command: Mount failed. Check the log file for more details. output from /var/log/glusterfs/mnt.log added at Details

second case gives volume add-brick: failed: Commit failed on gluster17. Please check log file for details. output from glusterd.log in Details

Expected results: mnt.log from mounting with =< 16 replicated volume nodes

[2023-02-09 15:53:35.039980 +0000] I [fuse-bridge.c:5294:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.32
[2023-02-09 15:53:35.039997 +0000] I [fuse-bridge.c:5926:fuse_graph_sync] 0-fuse: switched to graph 0
[2023-02-09 15:53:35.042394 +0000] I [MSGID: 108031] [afr-common.c:3201:afr_local_discovery_cbk] 0-gv3-replicate-0: selecting local read_child gv3-client-1

Mandatory info: - The output of the gluster volume info command:

Volume Name: gv3
Type: Replicate
Volume ID: 97c058ca-e019-4332-b3cc-4c2848dc8691
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 17 = 17
Transport-type: tcp
Bricks:
Brick1: gluster1:/data3
Brick2: gluster2:/data3
Brick3: gluster3:/data3
Brick4: gluster4:/data3
Brick5: gluster5:/data3
Brick6: gluster6:/data3
Brick7: gluster7:/data3
Brick8: gluster8:/data3
Brick9: gluster9:/data3
Brick10: gluster10:/data3
Brick11: gluster11:/data3
Brick12: gluster12:/data3
Brick13: gluster13:/data3
Brick14: gluster14:/data3
Brick15: gluster15:/data3
Brick16: gluster16:/data3
Brick17: gluster17:/data3
Options Reconfigured:
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

- The output of the gluster volume status command:

Status of volume: gv3
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick gluster1:/data3                       60324     0          Y       441  
Brick gluster2:/data3                       53482     0          Y       189  
Brick gluster3:/data3                       58732     0          Y       176  
Brick gluster4:/data3                       52387     0          Y       177  
Brick gluster5:/data3                       55738     0          Y       178  
Brick gluster6:/data3                       54841     0          Y       176  
Brick gluster7:/data3                       53311     0          Y       177  
Brick gluster8:/data3                       56091     0          Y       178  
Brick gluster9:/data3                       57727     0          Y       177  
Brick gluster10:/data3                      52444     0          Y       177  
Brick gluster11:/data3                      53413     0          Y       177  
Brick gluster12:/data3                      54642     0          Y       177  
Brick gluster13:/data3                      50610     0          Y       178  
Brick gluster14:/data3                      51263     0          Y       177  
Brick gluster15:/data3                      59344     0          Y       178  
Brick gluster16:/data3                      49325     0          Y       178  
Brick gluster17:/data3                      58634     0          Y       177  
Self-heal Daemon on localhost               N/A       N/A        Y       206  
Self-heal Daemon on gluster3                N/A       N/A        Y       193  
Self-heal Daemon on gluster4                N/A       N/A        Y       194  
Self-heal Daemon on gluster6                N/A       N/A        Y       193  
Self-heal Daemon on gluster5                N/A       N/A        Y       195  
Self-heal Daemon on gluster7                N/A       N/A        Y       194  
Self-heal Daemon on gluster10               N/A       N/A        Y       194  
Self-heal Daemon on gluster9                N/A       N/A        Y       194  
Self-heal Daemon on gluster11               N/A       N/A        Y       194  
Self-heal Daemon on gluster8                N/A       N/A        Y       195  
Self-heal Daemon on gluster12               N/A       N/A        Y       194  
Self-heal Daemon on gluster13               N/A       N/A        Y       195  
Self-heal Daemon on gluster15               N/A       N/A        Y       195  
Self-heal Daemon on gluster16               N/A       N/A        Y       195  
Self-heal Daemon on gluster14               N/A       N/A        Y       194  
Self-heal Daemon on gluster19               N/A       N/A        Y       178  
Self-heal Daemon on gluster1.gluster-try    N/A       N/A        Y       458  
Self-heal Daemon on gluster18               N/A       N/A        Y       179  
Self-heal Daemon on gluster17               N/A       N/A        Y       194  
Self-heal Daemon on gluster20               N/A       N/A        Y       177  
 
Task Status of Volume gv3
------------------------------------------------------------------------------
There are no active volume tasks

**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

mnt.log for volume mount case https://pastebin.com/2yYseJ2X
glusterd.log from add-brick case https://pastebin.com/iaJWNHLS

- The operating system / glusterfs version: glusterfs 10.3 Linux 5.10.124-linuxkit #1 SMP PREEMPT aarch64 aarch64 aarch64 GNU/Linux

Feb 10 '23 14:02 pborowskiCT

Probably has to do with this code:

static int
__afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
                            unsigned char *metadata, int *event_p)
{
    afr_private_t *priv = NULL;
    int ret = -1;

    priv = this->private;

    if (priv->child_count <= 16)
        ret = __afr_inode_read_subvol_get_small(inode, this, data, metadata,
                                                event_p);
    else
        /* TBD: allocate structure with array and read from it */
        ret = -1;

    return ret;
}

But I'm not sure what's the use case here.

Feb 11 '23 15:02 mykaul

Oh, that may be it.

if you're asking about my use case, we want to have N servers, which all serve the same files. We use glusterFS to easily monitor/replicate/manage the state of these files between servers and since performance is a huge factor, we need to store these files locally (i.e. we don't use gluster clients on these N servers or any kind of distributed volume so we're able to serve all the files as quickly as possible - we manage them by clients located on other machines). We needed to scale beyond 16 servers and that's when we encountered this issue.

I've found similar issue reported here https://www.mail-archive.com/[email protected]/msg35236.html

Feb 14 '23 14:02 pborowskiCT

@pborowskiCT if I understand correctly what you say, you are directly reading the data from the brick on each client instead of going through a regular Gluster mount point, right ?

This usage is unsupported. First of all this bypasses all integrity and consistency checks that Gluster does, so it may happen that each client sees different data inside the same file. The data may be stale or corrupted and there's no way to know that directly accessing the brick contents.

Also, a replication factor of 16 is a huge overhead in terms of space and performance. Even though Gluster accepts configurations up to 16 replicas, they are not tested at all, so unexpected behaviours could happen, specially while healing.

Feb 15 '23 09:02 xhernandez

@xhernandez thanks for replying. You understood correctly, but we assumed that if gluster heals correctly and heal status returns no error, then we are guaranteed to have healthy copies of all files on every server. We were aware that it was circumventing the approved way of accessing the files, but we had no problems with this setup so far and it worked for our use case perfectly.

But besides our "atypical" way of accessing the files, the issue persists - maybe >16 replicas setup is too uncommon to work on it, but if that's the case, I think there should be some mechanism to block users from trying to achieve it or at least a mention of the issue in the docs (and maybe an overall advice to avoid multi-node replicated volumes since you wrote they are not tested configurations)?

Feb 20 '23 16:02 pborowskiCT

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

Oct 15 '23 13:10 stale[bot]

glusterfs
glusterfs copied to clipboard

Replicated volume with > 16 bricks problem

glusterfs glusterfs copied to clipboard

Replicated volume with > 16 bricks problem

glusterfs
glusterfs copied to clipboard