glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

dht_revalidate_cbk() needs to trigger directory heal with root permissions and negative pid

Open itisravi opened this issue 3 years ago • 1 comments

Description of problem: We are encountering an issue where a few directories are missing on a brick on a secondary volume of a geo-replicated setup. While we don't know the RCA of this issue yet, DHT already has logic to heal missing dirs and fix the holes in the layout distribution with a lookup. But we saw a few logs like given below, indicating layout healing failing with EROFS (In geo-rep, secondary vols are read-only by default).

[MSGID: 114031] [client-rpc-fops_v2.c:224:client4_0_mkdir_cbk] 0-ns1-client-0: remote operation failed. [{path=/small/file_srcdir/DRAVID-N6/thrd_03/d_009/d_004}, {errno=30}, {error=Read-only file system}]
[MSGID: 109005] [dht-selfheal.c:1064:dht_selfheal_dir_mkdir_cbk] 0-ns1-dht: Healing of path failed [{path=/small/file_srcdir/DRAVID-N6/thrd_03/d_009/d_004}, {gfid=d7953607-1a53-40e4-a43a-22bec90bd8b8}, {errno=30}, {error=Read-only file system}]
[MSGID: 114031] [client-rpc-fops_v2.c:2016:client4_0_setattr_cbk] 0-ns1-client-1: remote operation failed. [{errno=30}, {error=Read-only file system}]
[MSGID: 114031] [client-rpc-fops_v2.c:2016:client4_0_setattr_cbk] 0-ns1-client-0: remote operation failed. [{errno=116}, {error=Stale file handle}]
[MSGID: 109114] [dht-lock.c:1038:dht_blocking_inodelk_cbk] 0-ns1-dht: inodelk failed on subvol [{subvol=ns1-readdir-ahead-0}, {gfid=d7953607-1a53-40e4-a43a-22bec90bd8b8}, {errno=116}, {error=Stale file handle}]


[server-rpc-fops_v2.c:503:server4_mkdir_cbk] 0-ns1-server: MKDIR info [{frame=214}, {MKDIR_path=/small/file_srcdir/DRAVID-N6/thrd_03/d_009/d_004}, {uuid_utoa=5c95d499-7f22-428f-aa71-24e8a1732022}, {bname=d_004}, {client=CTX_ID:c32b9d32-31ff-4a4a-881e-46d8c728f9f2-GRAPH_ID:0-PID:21333-HOST:phfs-node1-PC_NAME:ns1-client-0-RECON_NO:-1}, {error-xlator=ns1-read-only}, {errno=30}, {error=Read-only file system}] 
[posix-entry-ops.c:262:posix_lookup] 0-ns1-posix: Found stale gfid handle 95/d7953607-1a53-40e4-a43a-22bec90bd8b8, removing it. [No such file or directory]
[server-rpc-fops_v2.c:1686:server4_setattr_cbk] 0-ns1-server: SETATTR info [{frame=215}, {path=}, {uuid_utoa=d7953607-1a53-40e4-a43a-22bec90bd8b8}, {client=CTX_ID:c32b9d32-31ff-4a4a-881e-46d8c728f9f2-GRAPH_ID:0-PID:21333-HOST:phfs-node1-PC_NAME:ns1-client-0-RECON_NO:-1}, {error-xlator=ns1-posix}, {errno=116}, {error=Stale file handle}]

We fixed it by restarting the brick and triggering lookup from a fresh mount, but it appears from code reading that if we trigger lookup from an existing mount which already has the inode, it might trigger the heal with incorrect permissions causing it to fail with EROFS. Fresh lookup code path does not have this issue.

itisravi avatar Feb 16 '22 15:02 itisravi

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

stale[bot] avatar Sep 21 '22 00:09 stale[bot]

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

stale[bot] avatar Nov 01 '22 21:11 stale[bot]