Kadalu-GlusterFS Heal is taking more time/stuck after node remove and addition of new node.
In our K8 cluster cluster, we have a KadaluStorage deployed with Replica3. Following is the spec for the same. `apiVersion: kadalu-operator.storage/v1alpha1 kind: KadaluStorage metadata: name: cvol namespace: ns spec: pvReclaimPolicy: retain single_pv_per_pool: false storage:
- node: node-1 path: /path
- node: node-2 path: /path
- node: node-3 path: /path type: Replica3`
- As part of resiliency test, we have removed one node and added back the different node, with no data in it. This scenario we executed in in multiple clusters. After node is added back, most of the time data is healed to newly added without any issues.
- But intermittently we are seeing the issue, where data is stuck for healing longer time(data size is hardly in KBs)/stuck indefinitely.
While the issue happens, following are the findings.
- Seen different GFID for same file in bricks. But the data is not populating in SPLIT-BRAIN, only shown as HealPending.
Node-1: BrickPath: Fileserver_metadata.json` exists, following is the GFID of the same server_metadata.json
file: server_metadata.json
trusted.afr.common-storage-pool-client-1=0x0000073b0000073b00000000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x0edf887de4444c1e98b74085b1d89e72 trusted.gfid2path.3ebeedcf20fb201b=0x37356339653135342d333361352d343864372d613830362d6437333731333439656236662f7365727665725f6d657461646174612e6a736f6e trusted.glusterfs.mdata=0x0100000000000000000000000068946929000000000deeb0590000000068946929000000000d9684c90000000068946929000000000d775ead
Node-2:
BrickPath: File server_metadata.json exists, following is the GFID of the same
server_metadata.json
file: server_metadata.json
trusted.afr.common-storage-pool-client-1=0x000008390000083900000000 trusted.afr.dirty=0x000000000000000000000000 trusted.gfid=0x2b26141413f549758d4eaee483fe506f trusted.gfid2path.3ebeedcf20fb201b=0x37356339653135342d333361352d343864372d613830362d6437333731333439656236662f7365727665725f6d657461646174612e6a736f6e trusted.glusterfs.mdata=0x01000000000000000000000000689470ec000000001db6a36f00000000689470ec000000001d54634c00000000689470ec000000001d2a0942
Node-3:[Newly added node] BrickPath: As this is a newly added node server_metadata.json, is not present[as expected]
Fuse Mount: Accessing the same file from FUSE mount side, gives following error. ls: cannot access 'server_metadata.json': Input/output error total 1 -????????? ? ? ? ? ? server_metadata.json `
- File is only present in one node(creation/deletion FOP could have created this scenario), same file is not visible in FUSE client mount. To recreate this issue we following steps can help:
- Keep only two server(brick) pods running
- Induce a random delay on one of the server pod
- Start file creation from the FUSE client
- Restart the server-pod(which has no network delay), and client where file-Creation running
- If we try above steps in a loop, we can create a scenario, where one brick has the file created and other is not.
- Now bring the new node into the system(this will have no data).
- Now we can see, above created file is stuck for healing. NOTE: This healing stuck is not observed always, sometimes it recovers automatically.
Questions for Upstream 1. Can you help us understand why this behavior (divergent GFIDs for the same path, not reported as split-brain) is happening? What are the underlying scenarios in GlusterFS that can lead to this state? 2. What are the recommended strategies to prevent this specific type of inconsistency from occurring in a Replica3 setup, especially during node replacement and transient network issues? 3. What are the proper methods for recovering from this issue when files are stuck in HealPending due to divergent GFIDs?
@mohammaddawoodshaik Do you have rename workload? If yes, could you also provide output of getfattr -d -m . -e hex <parent-dir-of-server_metadata.json
@pranithk - No we don't have any rename workload. Our workload mainly involves burst write of data and later delete it. This happens in a loop quite frequently.
@mohammaddawoodshaik
@pranithk - No we don't have any rename workload. Our workload mainly involves burst write of data and later delete it. This happens in a loop quite frequently.
Could you provide output of getfattr -d -m . -e hex <parent-dir-of-server_metadata.json when you see this issue again? Gfid split brain is splitbrain of parent dir.