glusterfs
glusterfs copied to clipboard
Replicated volume with > 16 bricks problem
Description of problem: if I'm trying to create and mount client to glusterfs replicated volume with =< 16 replica nodes its all okay. but the problem is, when creating glusterfs replicated volume with number of bricks > 16 (e.g. number of bricks: 1 x 17 = 17) there is volume mount issue OR the problem occurs with adding one brick to 16 replicaded nodes volume (volume can't be extended or its extended, but only if its fresh volume with no mounts to it)
The exact command to reproduce the issue: first case: after peering probes and creating volume with
gluster volume create gv3 replica 17 gluster1:/data3 gluster2:/data3 gluster3:/data3 gluster4:/data3 gluster5:/data3 gluster6:/data3 gluster7:/data3 gluster8:/data3 gluster9:/data3 gluster10:/data3 gluster11:/data3 gluster12:/data3 gluster13:/data3 gluster14:/data3 gluster15:/data3 gluster16:/data3 gluster17:/data3 force
gluster volume start gv3
command that do not execute right is:
mount -t glusterfs gluster1:/gv3 /mnt
second case is when adding one brick to 16 nodes replicated volume
gluster volume add-brick gv3 replica 17 gluster17:/data3 force
The full output of the command that failed:
output from mount command:
Mount failed. Check the log file for more details.
output from /var/log/glusterfs/mnt.log added at Details
second case gives
volume add-brick: failed: Commit failed on gluster17. Please check log file for details.
output from glusterd.log in Details
Expected results: mnt.log from mounting with =< 16 replicated volume nodes
[2023-02-09 15:53:35.039980 +0000] I [fuse-bridge.c:5294:fuse_init] 0-glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.24 kernel 7.32
[2023-02-09 15:53:35.039997 +0000] I [fuse-bridge.c:5926:fuse_graph_sync] 0-fuse: switched to graph 0
[2023-02-09 15:53:35.042394 +0000] I [MSGID: 108031] [afr-common.c:3201:afr_local_discovery_cbk] 0-gv3-replicate-0: selecting local read_child gv3-client-1
Mandatory info:
- The output of the gluster volume info command:
Volume Name: gv3
Type: Replicate
Volume ID: 97c058ca-e019-4332-b3cc-4c2848dc8691
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 17 = 17
Transport-type: tcp
Bricks:
Brick1: gluster1:/data3
Brick2: gluster2:/data3
Brick3: gluster3:/data3
Brick4: gluster4:/data3
Brick5: gluster5:/data3
Brick6: gluster6:/data3
Brick7: gluster7:/data3
Brick8: gluster8:/data3
Brick9: gluster9:/data3
Brick10: gluster10:/data3
Brick11: gluster11:/data3
Brick12: gluster12:/data3
Brick13: gluster13:/data3
Brick14: gluster14:/data3
Brick15: gluster15:/data3
Brick16: gluster16:/data3
Brick17: gluster17:/data3
Options Reconfigured:
cluster.granular-entry-heal: on
storage.fips-mode-rchecksum: on
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
- The output of the gluster volume status command:
Status of volume: gv3
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick gluster1:/data3 60324 0 Y 441
Brick gluster2:/data3 53482 0 Y 189
Brick gluster3:/data3 58732 0 Y 176
Brick gluster4:/data3 52387 0 Y 177
Brick gluster5:/data3 55738 0 Y 178
Brick gluster6:/data3 54841 0 Y 176
Brick gluster7:/data3 53311 0 Y 177
Brick gluster8:/data3 56091 0 Y 178
Brick gluster9:/data3 57727 0 Y 177
Brick gluster10:/data3 52444 0 Y 177
Brick gluster11:/data3 53413 0 Y 177
Brick gluster12:/data3 54642 0 Y 177
Brick gluster13:/data3 50610 0 Y 178
Brick gluster14:/data3 51263 0 Y 177
Brick gluster15:/data3 59344 0 Y 178
Brick gluster16:/data3 49325 0 Y 178
Brick gluster17:/data3 58634 0 Y 177
Self-heal Daemon on localhost N/A N/A Y 206
Self-heal Daemon on gluster3 N/A N/A Y 193
Self-heal Daemon on gluster4 N/A N/A Y 194
Self-heal Daemon on gluster6 N/A N/A Y 193
Self-heal Daemon on gluster5 N/A N/A Y 195
Self-heal Daemon on gluster7 N/A N/A Y 194
Self-heal Daemon on gluster10 N/A N/A Y 194
Self-heal Daemon on gluster9 N/A N/A Y 194
Self-heal Daemon on gluster11 N/A N/A Y 194
Self-heal Daemon on gluster8 N/A N/A Y 195
Self-heal Daemon on gluster12 N/A N/A Y 194
Self-heal Daemon on gluster13 N/A N/A Y 195
Self-heal Daemon on gluster15 N/A N/A Y 195
Self-heal Daemon on gluster16 N/A N/A Y 195
Self-heal Daemon on gluster14 N/A N/A Y 194
Self-heal Daemon on gluster19 N/A N/A Y 178
Self-heal Daemon on gluster1.gluster-try N/A N/A Y 458
Self-heal Daemon on gluster18 N/A N/A Y 179
Self-heal Daemon on gluster17 N/A N/A Y 194
Self-heal Daemon on gluster20 N/A N/A Y 177
Task Status of Volume gv3
------------------------------------------------------------------------------
There are no active volume tasks
**- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/
-
mnt.log for volume mount case https://pastebin.com/2yYseJ2X
-
glusterd.log from add-brick case https://pastebin.com/iaJWNHLS
- The operating system / glusterfs version:
glusterfs 10.3
Linux 5.10.124-linuxkit #1 SMP PREEMPT aarch64 aarch64 aarch64 GNU/Linux
Probably has to do with this code:
static int
__afr_inode_read_subvol_get(inode_t *inode, xlator_t *this, unsigned char *data,
unsigned char *metadata, int *event_p)
{
afr_private_t *priv = NULL;
int ret = -1;
priv = this->private;
if (priv->child_count <= 16)
ret = __afr_inode_read_subvol_get_small(inode, this, data, metadata,
event_p);
else
/* TBD: allocate structure with array and read from it */
ret = -1;
return ret;
}
But I'm not sure what's the use case here.
Oh, that may be it.
if you're asking about my use case, we want to have N servers, which all serve the same files. We use glusterFS to easily monitor/replicate/manage the state of these files between servers and since performance is a huge factor, we need to store these files locally (i.e. we don't use gluster clients on these N servers or any kind of distributed volume so we're able to serve all the files as quickly as possible - we manage them by clients located on other machines). We needed to scale beyond 16 servers and that's when we encountered this issue.
I've found similar issue reported here https://www.mail-archive.com/[email protected]/msg35236.html
@pborowskiCT if I understand correctly what you say, you are directly reading the data from the brick on each client instead of going through a regular Gluster mount point, right ?
This usage is unsupported. First of all this bypasses all integrity and consistency checks that Gluster does, so it may happen that each client sees different data inside the same file. The data may be stale or corrupted and there's no way to know that directly accessing the brick contents.
Also, a replication factor of 16 is a huge overhead in terms of space and performance. Even though Gluster accepts configurations up to 16 replicas, they are not tested at all, so unexpected behaviours could happen, specially while healing.
@xhernandez thanks for replying. You understood correctly, but we assumed that if gluster heals correctly and heal status returns no error, then we are guaranteed to have healthy copies of all files on every server. We were aware that it was circumventing the approved way of accessing the files, but we had no problems with this setup so far and it worked for our use case perfectly.
But besides our "atypical" way of accessing the files, the issue persists - maybe >16 replicas setup is too uncommon to work on it, but if that's the case, I think there should be some mechanism to block users from trying to achieve it or at least a mention of the issue in the docs (and maybe an overall advice to avoid multi-node replicated volumes since you wrote they are not tested configurations)?
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.