glusterfs icon indicating copy to clipboard operation
glusterfs copied to clipboard

Heal count has been increasing for days and currently at over 34 million

Open dalwise opened this issue 2 years ago • 61 comments

Description of problem: Heal count keeps increasing. Current status after 13 days of increases is:

# gluster volume heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 34410002

Brick host_name22:/shared/.brick
Number of entries: 34363886

Brick host_name20:/shared/.brick
Number of entries: 0

The exact command to reproduce the issue: The issue happened after trying to recover from an inaccessible directory (/shared/vol_name/logs) within the volume mounted at /shared/vol_name. Trying to access this directory would return "Transport endpoint not connected" on all clients. Other directories in the mounted volume were not affected.

gluster volume heal showed GFIDs in need of healing and the problematic directory in split brain. We were not able to get the files healed by using the gluster heal commands. We were then able to resolve problems with most GFIDs by removing the corresponding files in the bricks. However one GFID remained in need of healing on host_name20 and we could not determine what file it corresponded to. Since host_name20 just had the arbiter brick we tried removing it:

gluster volume remove-brick vol_name replica 2 host_name20:/shared/.brick force
gluster peer detach host_name20

That allowed us to access the directory that we could not see earlier. We then attempted to rejoin the arbiter with a clean brick:

rm -rf /shared/.brick/*  # Ran this on host_name20
gluster peer probe host_name20
gluster volume add-brick vol_name replica 3 arbiter 1 host_name20:/shared/.brick force

The directory is still accessible, but the number of files in need of healing has been increasing for the last 13 days.

We can likely recover by simply backing up the files, destroying the current volume and then moving the files onto a newly created volume. However we are at a loss as to:

  • How the initial problem with directory not being accessible happened
  • How to prevent it from happening again
  • How to deal with it more efficiently if the problem does happen again

The full output of the command that failed: (heal count command above)

Expected results: Heal count to return to 0.

Mandatory info: - The output of the gluster volume info command:

Volume Name: vol_name
Type: Replicate
Volume ID: 4833820f-4518-4a2e-a3d8-63d6f31c4646
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: host_name21:/shared/.brick
Brick2: host_name22:/shared/.brick
Brick3: host_name20:/shared/.brick (arbiter)
Options Reconfigured:
cluster.self-heal-window-size: 2
cluster.shd-wait-qlength: 2048
disperse.shd-wait-qlength: 2048
cluster.shd-max-threads: 8
cluster.self-heal-daemon: enable
transport.address-family: inet6
nfs.disable: on
performance.client-io-threads: off

- The output of the gluster volume status command:

Status of volume: vol_name
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick host_name21:/shared/.brick                  49153     0          Y       55517
Brick host_name22:/shared/.brick                  49153     0          Y       60313
Brick host_name20:/shared/.brick                  49152     0          Y       59555
Self-heal Daemon on localhost               N/A       N/A        Y       58347
Self-heal Daemon on host_name22                   N/A       N/A        Y       89514
Self-heal Daemon on host_name21                   N/A       N/A        Y       57650

Task Status of Volume vol_name
------------------------------------------------------------------------------
There are no active volume tasks

- The output of the gluster volume heal command: (Shared heal count above as heal contains tens of millions of entries)

- Provide logs present on following locations of client and server nodes - /var/log/glusterfs/

https://www.dropbox.com/s/p93wyyztj5bmzk8/logs.tgz

This compressed log file contains the logs for all three server nodes. The server nodes have the volume mounted and so are acting as clients also.

The volume name is the name of an internal project and has been changed to "vol_name" in command outputs and logs. The hostnames are also internal and have been changed to host_name20, host_name21 & host_name22.

**- Is there any crash ? Provide the backtrace and coredump No crash

Additional info:

- The operating system / glusterfs version:

  • CentOS Linux release 7.6.1810 (Core)
  • Gluster 7.4-1

Note: Please hide any confidential data which you don't want to share in public like IP address, file name, hostname or any other configuration

dalwise avatar Dec 24 '21 02:12 dalwise

I've done a quick check of the self-heal logs, and it seems there are no issues. Healing seems to be progressing. There have been around 8 million files healed since 12th of December.

How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?

Can you check if the number of files in host_name20 is increasing ? if so, it means that self-heal is doing its job, but it's running too slowly. If that's the case, probably you could use the recommendations from issue #3025. You should disable least priority:

# gluster volume set <volname> performance.enable-least-priority no

And you can also start additional self-heal processes as specified in that issue.

IMPORTANT: Issue #3025 is for Gluster 9.4. It should also work in your version, but I recommend that you don't try this in production without having done tests in a lab environment.

xhernandez avatar Dec 24 '21 10:12 xhernandez

Thanks for your help, Xavi.

How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?

Yes, from the brick directory on each host:

[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
46536502
[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
27520629
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113235113
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30124557
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113281980
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30169389

I am not sure what would explain the discrepancy between host_name21 and host_name22, since there were no files in need of healing on them when host_name20 got removed and nothing outside of gluster should currently be writing to the volume.

Can you check if the number of files in host_name20 is increasing ?

The number of files shown using the command above is slowly increasing, but after two weeks it is still less than half what the other two servers have.

One thing I have noticed and was not expecting is that the current brick size on host_name20 (the arbiter) is already larger than it was on the backup of the brick that we made before wiping it:

[root@host_name20 .brick]# du -sh /shared/dot_brick_backup/
4.0G	/shared/dot_brick_backup/
[root@host_name20 .brick]# du -sh /shared/.brick/
5.0G	/shared/.brick/

I'll try the suggestions from #3025. I need the data on the volume, but this stack of servers has been taken out of production until we recover it. So I'm thinking of first backing up the data and then trying out the improvements from #3025 on the affected stack itself. Worst case scenario I have the backed up data and can copy it back to a fresh volume. Would it be OK to back up the data directly from the host_name21 or host_name22 brick? I assume this would be faster than backing up from the mounted volume and there is nothing currently being written on the volume.

One thing that I am uncertain about when following the guidance in #3025 is what server I should be monitoring for IO stats: host_name20 is the one that re-joined the volume and is in need of repair, so I'm not quite sure why heal count shows no files needing healing on it and instead shows them on the remaining two servers.

dalwise avatar Dec 25 '21 07:12 dalwise

Thanks for your help, Xavi.

How many files are there in the volume ? is it possible that there are more than 40 million files with many directories ?

Yes, from the brick directory on each host:

[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
46536502
[root@host_name20 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
27520629
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113235113
[root@host_name21 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30124557
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -print | wc -l
113281980
[root@host_name22 .brick]# find . -path ./.glusterfs -prune -o -type d -print | wc -l
30169389

I am not sure what would explain the discrepancy between host_name21 and host_name22, since there were no files in need of healing on them when host_name20 got removed and nothing outside of gluster should currently be writing to the volume.

If there's new data being added to the volume, it could be explained by the different times the find command was run. Otherwise it's something we will need to check once self-heal completes.

Can you check if the number of files in host_name20 is increasing ?

The number of files shown using the command above is slowly increasing, but after two weeks it is still less than half what the other two servers have.

Gluster needs a significant amount of metadata and locking operations per entry. Given the number of files and directories you have, it's likely that most of the time is spent doing those operations. I think parallelizing the self-help should help here.

One thing I have noticed and was not expecting is that the current brick size on host_name20 (the arbiter) is already larger than it was on the backup of the brick that we made before wiping it:

[root@host_name20 .brick]# du -sh /shared/dot_brick_backup/
4.0G	/shared/dot_brick_backup/
[root@host_name20 .brick]# du -sh /shared/.brick/
5.0G	/shared/.brick/

Are you using XFS as the brick filesystem ? we have seen that XFS does preallocation of disk blocks in some cases to optimize future writes. Since currently self-heal is creating a lot of new entries, this extra reservation of blocks could be causing the overhead. If that's the case, it should fix itself over time once the volume is healed.

I'll try the suggestions from #3025. I need the data on the volume, but this stack of servers has been taken out of production until we recover it. So I'm thinking of first backing up the data and then trying out the improvements from #3025 on the affected stack itself. Worst case scenario I have the backed up data and can copy it back to a fresh volume. Would it be OK to back up the data directly from the host_name21 or host_name22 brick? I assume this would be faster than backing up from the mounted volume and there is nothing currently being written on the volume.

If you plan to restore the data only to a new volume and you are sure that the data in those bricks is healthy, then yes, it should be enough to copy one brick. However, given the discrepancy seen in the outputs of find command, first I would try to understand why that happened to be sure we are not missing something by only copying one brick.

One thing that I am uncertain about when following the guidance in #3025 is what server I should be monitoring for IO stats: host_name20 is the one that re-joined the volume and is in need of repair, so I'm not quite sure why heal count shows no files needing healing on it and instead shows them on the remaining two servers.

The number that heal count shows is not the number of entries that need healing in each brick. It's the number of entries updated in each brick that failed to update in another brick (so needing heal in another brick). When a brick goes down, the heal count of the surviving bricks will start to increase when changes are done to entries that are also present in the missing brick. When the bad brick comes back again, the heal count of that brick will be 0, but the other bricks will have the list of entries that need to be healed in the bad brick. So most of the write activity will happen in the bad brick, even if it has heal count = 0.

xhernandez avatar Dec 27 '21 08:12 xhernandez

If there's new data being added to the volume, it could be explained by the different times the find command was run. Otherwise it's something we will need to check once self-heal completes.

The hosts were taken out of production soon after the initial problem, on December 10. There should be no data being written to the volume.

Are you using XFS as the brick filesystem ? we have seen that XFS does preallocation of disk blocks in some cases to optimize future writes. Since currently self-heal is creating a lot of new entries, this extra reservation of blocks could be causing the overhead. If that's the case, it should fix itself over time once the volume is healed.

Yes, the brick's filesystem is indeed XFS.

If you plan to restore the data only to a new volume and you are sure that the data in those bricks is healthy, then yes, it should be enough to copy one brick. However, given the discrepancy seen in the outputs of find command, first I would try to understand why that happened to be sure we are not missing something by only copying one brick.

I have been doing an rsync from the mounted volume, but it is progressing at only ~50GB/day. Since the volume has 1.3T the backup would take almost a month. Perhaps I can try backing up both bricks rather than just one before trying the repairs from #3025?

The number that heal count shows is not the number of entries that need healing in each brick. It's the number of entries updated in each brick that failed to update in another brick (so needing heal in another brick). When a brick goes down, the heal count of the surviving bricks will start to increase when changes are done to entries that are also present in the missing brick. When the bad brick comes back again, the heal count of that brick will be 0, but the other bricks will have the list of entries that need to be healed in the bad brick. So most of the write activity will happen in the bad brick, even if it has heal count = 0.

Thank you for your very clear explanation. If most of the activity will happen on the brick of host_name20 then I think it may be challenging to evaluate the effect of changes by looking at I/O stats as in #3025: since host_name20 is an arbiter the amount of data written to it as far as I know is minimal. Do you think we may be better off monitoring the effect of changes with other metrics? Maybe the ones above like the number of entries in heal-count or the results of the find on the brick?

dalwise avatar Dec 29 '21 21:12 dalwise

The hosts were taken out of production soon after the initial problem, on December 10. There should be no data being written to the volume.

In that case the difference in number of files and directories seems suspicious. Maybe you could compare the output of two finds to see where are the differences.

I have been doing an rsync from the mounted volume, but it is progressing at only ~50GB/day. Since the volume has 1.3T the backup would take almost a month. Perhaps I can try backing up both bricks rather than just one before trying the repairs from #3025?

I think that will be the safest way until we understand where the difference comes from.

Thank you for your very clear explanation. If most of the activity will happen on the brick of host_name20 then I think it may be challenging to evaluate the effect of changes by looking at I/O stats as in #3025: since host_name20 is an arbiter the amount of data written to it as far as I know is minimal. Do you think we may be better off monitoring the effect of changes with other metrics? Maybe the ones above like the number of entries in heal-count or the results of the find on the brick?

Probably the easiest way to monitor the progress of self-heal and be able to compare different configurations is to check the number of used inodes of the brick periodically (df -i <brick path>). It can't be used to know exactly how many files have been created because Gluster uses 2 inodes for directories and 1 inode for other entries, but it's enough to do a comparison, for example using the average number of inodes created per second.

xhernandez avatar Dec 29 '21 22:12 xhernandez

Thanks Xavi! I'll do the following then:

  1. Look for differences in the find of bricks in host_name21 and host_name22
  2. Back up the bricks on both hosts
  3. Set up monitoring of df -i <brick path> so that we can have a baseline before making changes.

dalwise avatar Dec 29 '21 22:12 dalwise

Hi Xavi, I'm just providing a quick update. The backups are still ongoing, even when backing up directly from the bricks. I ran out of inodes for the backup, so I'm removing .glusterfs directory from the backup.

dalwise avatar Jan 06 '22 06:01 dalwise

Hi @dalwise. How is it going ?

Is the self-heal still progressing on the arbiter brick or have you stopped it ? if it's still running, probably it should have already identified all pending files to be healed and the number of pending files should be near 80 or 90 million and slowly decreasing. Is that the case ?

Is self-heal log still clean of errors ?

xhernandez avatar Jan 17 '22 18:01 xhernandez

Hi @xhernandez, thanks for checking in!

Self-heal is ongoing. It is currently at over 64M and increasing by about ~50k per hour.

I had some problems backing up the bricks. I first tried backing up to the same partition as the brick but I ran out of inodes. I'm not sure why this happened as previous to the backup only 44% of the inodes were being used, so even doubling it should not have been an issue. My only theory is that rsync creates temporary inodes and those pushed the inode count.

I have since completed a local backup excluding the .glusterfs directory, which worked ok. I am now in the process of transferring that directory to a different host for offline backup. That has currently backed up ~92M inodes out of ~113M (but I'm using rsync, so based on the previous paragraph I'm not sure whether it will go somewhat over 113M before completing).

As for errors I don't see any on /var/log/glusterfs/glfsheal-vol_name.log in the arbiter node and in fact the last message is old:

[2021-12-29 18:09:48.459408] I [MSGID: 104041] [glfs-resolve.c:954:__glfs_active_subvol] 0-vol_name: switched to graph 716d6e32-302d-3133-3539-39342d323032 (0)
"/var/log/glusterfs/glfsheal-vol_name.log" 952L, 45317C

The remaining two nodes have that file empty.

dalwise avatar Jan 18 '22 19:01 dalwise

Self-heal is ongoing. It is currently at over 64M and increasing by about ~50k per hour.

I thought it would have discovered all directories by now. I was too optimistic...

I had some problems backing up the bricks. I first tried backing up to the same partition as the brick but I ran out of inodes. I'm not sure why this happened as previous to the backup only 44% of the inodes were being used, so even doubling it should not have been an issue. My only theory is that rsync creates temporary inodes and those pushed the inode count.

Many of the files inside .glusterfs are hardlinks to entries outside it, so unless you use -H option in rsync, the number of inodes created will be much higher.

I have since completed a local backup excluding the .glusterfs directory, which worked ok. I am now in the process of transferring that directory to a different host for offline backup. That has currently backed up ~92M inodes out of ~113M (but I'm using rsync, so based on the previous paragraph I'm not sure whether it will go somewhat over 113M before completing).

Just for reference, directories use two inodes, one for the real directory and one for a special symlink created inside .glusterfs. All other entries use a single inode with two hardlinks.

As for errors I don't see any on /var/log/glusterfs/glfsheal-vol_name.log in the arbiter node and in fact the last message is old:

[2021-12-29 18:09:48.459408] I [MSGID: 104041] [glfs-resolve.c:954:__glfs_active_subvol] 0-fava: switched to graph 716d6e32-302d-3133-3539-39342d323032 (0)
"/var/log/glusterfs/glfsheal-vol_name.log" 952L, 45317C

The remaining two nodes have that file empty.

You need to look at glustershd.log. That's the log used by self-heal during reconstruction of the arbiter.

xhernandez avatar Jan 18 '22 21:01 xhernandez

Many of the files inside .glusterfs are hardlinks to entries outside it, so unless you use -H option in rsync, the number of inodes created will be much higher.

Just for reference, directories use two inodes, one for the real directory and one for a special symlink created inside .glusterfs. All other entries use a single inode with two hardlinks.

Thanks for the tips!

You need to look at glustershd.log. That's the log used by self-heal during reconstruction of the arbiter.

The file on the arbiter is not showing any errors in the last few days. It seems to be healing fine. Here are the last few messages:

[2022-01-06 06:52:35.799445] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on e64041e0-7d9b-4ef0-b094-072a52d55588
[2022-01-06 06:52:35.801009] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 0f9bd30f-2fe8-4ddd-90c0-c7eee2882f9f
[2022-01-06 06:52:35.840371] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 749d8c7c-5348-40a3-83ab-f8fd3d357462
[2022-01-06 06:52:35.841280] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on e13f9126-3cf6-45c9-8dc8-34791ff12917. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.877509] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on dc89a327-a633-4be1-b8b1-363e7a9bfaf4. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.883342] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 6aa9108f-b5c0-44d7-a438-53b274522157. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.884929] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on e64041e0-7d9b-4ef0-b094-072a52d55588. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.885091] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 0f9bd30f-2fe8-4ddd-90c0-c7eee2882f9f. sources=[2]  sinks=0 1
[2022-01-06 06:52:35.892500] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 749d8c7c-5348-40a3-83ab-f8fd3d357462. sources=[2]  sinks=0 1
[2022-01-06 06:54:10.963909] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 5ed1e5c6-e8a6-4903-9525-da8ac0dc0467. sources= sinks=0 1 2
[2022-01-09 03:43:13.033691] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-09 03:43:13.033783] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-09 03:43:13.035032] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-09 03:43:13.035141] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-12 03:10:13.376457] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-12 03:10:13.376474] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-12 03:10:13.377881] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-12 03:10:13.378020] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-16 03:21:14.222491] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-16 03:21:14.222612] I [MSGID: 100011] [glusterfsd.c:1679:reincarnate] 0-glusterfsd: Fetching the volume file from server...
[2022-01-16 03:21:14.224039] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing
[2022-01-16 03:21:14.224112] I [MSGID: 100040] [glusterfsd-mgmt.c:106:mgmt_process_volfile] 0-glusterfs: No change in volfile, continuing

The same file on the data nodes is much more chatty, but seems to be fine also. It contains entries like"

[2022-01-18 22:47:43.073863] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 5-vol_name-replicate-0: performing metadata selfheal on 812e4ac8-7ecd-4a29-8490-cfe3475839c1
[2022-01-18 22:47:43.154375] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 5-vol_name-replicate-0: Completed metadata selfheal on 812e4ac8-7ecd-4a29-8490-cfe3475839c1. sources=0 [1]  sinks=2

Both data nodes seem to have no errors:

# egrep -c '\]\sE\s\[' /var/log/glusterfs/glustershd.log
0

Whereas the arbiter just has some errors from January 6 when one of the data nodes was left without available inodes.

dalwise avatar Jan 18 '22 22:01 dalwise

Hi @xhernandez,

The backup completed and I have been applying the solutions that you suggested from #3025, while monitoring the increase in inodes (via df -i). Unfortunately I haven't seen much noticeable improvement:

image

The black vertical line is when I ran gluster volume set vol_name performance.enable-least-priority no. It made a small improvement (from ~900 to ~920 inodes/minute), but it disappeared over time.

The green vertical lines are for when I started each of 5 additional glusterfs processes.

The purple vertical line is when I ran gluster volume set vol_name cluster.data-self-heal-algorithm full

The dips every hour are from the other monitoring which runs gluster vol heal vol_name statistics heal-count.

ps aux | grep glustershd shows the following. I have followed the pattern from the original glustershd I had running, even though it is slightly different from the one in #3025, which I assume is due to the different Gluster version (mine has asterisks in the --xlator-option argument):

root       9478  0.0  0.0 675796  5848 ?        Ssl  Jan28   0:03 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra2.pid -l /var/log/glusterfs/glustershd-extra2.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root      59414  1.0  0.0 1034436 35664 ?       Ssl   2021 756:10 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-shd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/f438aa9983d7f1fd.socket --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root      65548  0.0  0.0 675796  7800 ?        Ssl  Jan28   0:03 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra1.pid -l /var/log/glusterfs/glustershd-extra1.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root      71572  0.0  0.0 675796  7788 ?        Ssl  Jan28   0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra5.pid -l /var/log/glusterfs/glustershd-extra5.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root     128316  0.0  0.0 675796  5828 ?        Ssl  Jan28   0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra4.pid -l /var/log/glusterfs/glustershd-extra4.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root     184165  0.0  0.0 675796  7872 ?        Ssl  Jan28   0:02 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra3.pid -l /var/log/glusterfs/glustershd-extra3.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6
root     216441  0.0  0.0 112712   976 pts/0    S+   04:28   0:00 grep --color=auto glustershd

dalwise avatar Jan 29 '22 07:01 dalwise

@dalwise when you do configuration changes, you should also restart the self-heal processes started manually by you because they are not under the control of glusterd to apply the changes.

In any case, if that doesn't help, I would also try to modify these settings (restarting shd processes after that):

  • Increase cluster.self-heal-window-size to 4 or 8
  • Increase server.event-threads and client.event-threads to 4 or 6.

Can you post the current configuration settings ?

Are logs from extra self-heal processes showing activity ?

xhernandez avatar Jan 31 '22 08:01 xhernandez

Thank you @xhernandez for your help and quick response!

@dalwise when you do configuration changes, you should also restart the self-heal processes started manually by you because they are not under the control of glusterd to apply the changes.

Thanks for the tip

In any case, if that doesn't help, I would also try to modify these settings (restarting shd processes after that):

Increase cluster.self-heal-window-size to 4 or 8 Increase server.event-threads and client.event-threads to 4 or 6.

I did all of these. Not much difference though:

image

Can you post the current configuration settings ?

Sure. I assume you mean the output of gluster volume get vol_name all. Here is it: gluster-settings-220201.txt

Are logs from extra self-heal processes showing activity ?

As far as I can tell they are not. Here are the logs from /var/log/glusterfs/glustershd-extra1.log:

glustershd-extra1.log

The process that is generating it is:

root     152776  0.0  0.0 757724  5724 ?        Ssl  03:18   0:00 /usr/sbin/glusterfs -s localhost --volfile-id shd/vol_name -p /var/run/gluster/shd/vol_name/vol_name-extra1.pid -l /var/log/glusterfs/glustershd-extra1.log --xlator-option *replicate*.node-uuid=fb52f06d-4571-4486-94aa-77eb401d388e --process-name glustershd --client-pid=-6

Just to double check that we are on the same page: I have been running the commands you suggested and starting the extra glusterfs processes on the arbiter node, which is the one in need of healing. I haven't made any changes on the existing data nodes. My understanding is that the settings propagate to all nodes since they apply to the volume, but I want to be sure that we are on the same page regarding the extra glusterfs processes.

dalwise avatar Feb 01 '22 06:02 dalwise

Just to double check that we are on the same page: I have been running the commands you suggested and starting the extra glusterfs processes on the arbiter node, which is the one in need of healing. I haven't made any changes on the existing data nodes. My understanding is that the settings propagate to all nodes since they apply to the volume, but I want to be sure that we are on the same page regarding the extra glusterfs processes.

I missed to comment about this before... sorry.

Self-heal processes are run in each cluster node, and each one takes care of the list of pending files present in its own node only. This means that you should start extra self-heal processes in the other nodes (the ones that have a long list of pending files in gluster volume heal info). Since the arbiter brick was empty initially, it doesn't have damaged files, it's the other nodes that have damaged files pending to be healed into the arbiter brick.

I hope that starting extra processes in the other nodes you should see some increase in throughput. At least you should see activity in the logs.

xhernandez avatar Feb 02 '22 13:02 xhernandez

Hi @xhernandez, I'm definitely glad that we clarified that and I'm sorry for not bringing it up earlier. Adding extra processes in the other two nodes caused a huge increase in rate of new inodes created on the arbiter for about two hours. After that it has come down again, but is still considerably more than it was before the new processes (~2200/minute vs ~900/minute):

image

Each green line is when a new process was started on both of the data nodes.

As you can see just adding one additional process improved the stats by over an order of magnitude. The second one also seemed to help, but it's unclear that additional ones made any difference.

I'm not sure why a second process would have such a great effect (much more than double). Is it possible that the settings that I had been applying were not being applied to the process that was already running on each node?

The other metric that I was tracking (files needing healing) peaked while starting the additional processes and is now decreasing and changing much faster than previously. If it continues at this rate it would come down to zero within couple of days:

Wed Feb  2 15:45:01 UTC 2022	Number of entries: 77363403
Wed Feb  2 16:45:01 UTC 2022	Number of entries: 77395054
Wed Feb  2 17:45:01 UTC 2022	Number of entries: 77426850
Wed Feb  2 18:45:01 UTC 2022	Number of entries: 77458946
Wed Feb  2 19:45:01 UTC 2022	Number of entries: 77490137
Wed Feb  2 20:45:01 UTC 2022	Number of entries: 76807711
Wed Feb  2 21:45:01 UTC 2022	Number of entries: 74170119
Wed Feb  2 22:45:01 UTC 2022	Number of entries: 71708505
Wed Feb  2 23:45:01 UTC 2022	Number of entries: 69943704
Thu Feb  3 00:45:01 UTC 2022	Number of entries: 68211289

Thank you very much for your help in speeding up the healing process. I'll continue to monitor the progress. Would you have any guidance as to why the original problem happened and whether there was a better remediation available that may have saved us from the millions of entries in need of healing?

dalwise avatar Feb 03 '22 01:02 dalwise

Hi @xhernandez, I'm definitely glad that we clarified that and I'm sorry for not bringing it up earlier. Adding extra processes in the other two nodes caused a huge increase in rate of new inodes created on the arbiter for about two hours. After that it has come down again, but is still considerably more than it was before the new processes (~2200/minute vs ~900/minute):

That's great :)

Each green line is when a new process was started on both of the data nodes.

As you can see just adding one additional process improved the stats by over an order of magnitude. The second one also seemed to help, but it's unclear that additional ones made any difference.

I'm not sure why a second process would have such a great effect (much more than double). Is it possible that the settings that I had been applying were not being applied to the process that was already running on each node?

Self-heal daemons controlled by glusterd should have applied the changes immediately, so all settings should have been already in effect. It might be related to the way self-heal works. When a directory is healed, all entries inside it are created, but not healed yet. This means that after healing a directory, many files may have been created and added to the pending heal queue, increasing the pending heal count. This can cause bursts of file creations that could explain why you saw that increment (though I can't tell for sure why the improvement was so huge at the beginning). Once the number of directories pending heal is reduced, file creation rate drops and healers process more files than directories.

The other metric that I was tracking (files needing healing) peaked while starting the additional processes and is now decreasing and changing much faster than previously. If it continues at this rate it would come down to zero within couple of days:

Wed Feb  2 15:45:01 UTC 2022	Number of entries: 77363403
Wed Feb  2 16:45:01 UTC 2022	Number of entries: 77395054
Wed Feb  2 17:45:01 UTC 2022	Number of entries: 77426850
Wed Feb  2 18:45:01 UTC 2022	Number of entries: 77458946
Wed Feb  2 19:45:01 UTC 2022	Number of entries: 77490137
Wed Feb  2 20:45:01 UTC 2022	Number of entries: 76807711
Wed Feb  2 21:45:01 UTC 2022	Number of entries: 74170119
Wed Feb  2 22:45:01 UTC 2022	Number of entries: 71708505
Wed Feb  2 23:45:01 UTC 2022	Number of entries: 69943704
Thu Feb  3 00:45:01 UTC 2022	Number of entries: 68211289

The healing rate seems quite good now. Even though there are not so many file creations, most of the heals are just to set the metadata to the files created during directory heal. Since this is an arbiter brick, no data transfer is needed, so each heal is quite fast and files get removed from the pending list at a good pace.

Thank you very much for your help in speeding up the healing process. I'll continue to monitor the progress. Would you have any guidance as to why the original problem happened and whether there was a better remediation available that may have saved us from the millions of entries in need of healing?

When there's an issue with an specific directory, normally it's because some metadata or extended attribute doesn't match and self heal is unable to find enough consistent bricks to decide how to fix it in an automatic way (it requires more intelligence than what it's implemented inside it). However, after manual inspection of the metadata from each brick, it's normally easy to identify what's wrong and fix it (specially for directories, for files it may be harder if the contents don't match).

In your particular case, I can't tell what happened because I don't have the data before the brick was removed. However, in your first comment, you said that the directory was in split brain. This means that there were inconsistent changes between bricks, without a clear "clean" version of that directory from where to pull the data from the other bricks, so self-heal wasn't able to progress and the directory wasn't accessible.

When you deal with such a big volumes with tens of millions of directories and files, volume wise operations, like a full heal and rebalancing, are going to be slow. So if they can be avoided, they should be avoided. In this case, trying to fix whatever was wrong in that directory would have been much faster (most probably). It should be the first attempt always.

Also, you are using a pretty old version. There have been several heal and consistency improvements over time, besides bug fixes, that make some problems harder to happen, and more cases can be healed. I would recommend to keep the version more up to date to avoid some of the known problems, if possible.

xhernandez avatar Feb 04 '22 16:02 xhernandez

Thanks for the feedback @xhernandez.

As we were expecting the count of files in need of healing decreased quickly over the last couple of days. But it seems to have now stopped at around 2475. It has remained there for the last 7 hours:

# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 2475

Brick host_name22:/shared/.brick
Number of entries: 1

Brick host_name20:/shared/.brick
Number of entries: 1

I have now killed the extra glusterfs commands on all three nodes, since we no longer seem to have a performance issue and they could complicate checking for errors in logs.

I tried running gluster volume heal vol_name full, but that didn't seem to have any effect.

I am looking for errors in the logs using egrep '\]\sE\s\[' <file>. I've only found recent errors in /var/log/glusterfs/<mount_point>.log, which all look like:

[2022-02-07 19:10:29.206069] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7f3514ff993a] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8221)[0x7f350c3a9221] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8b3a)[0x7f350c3a9b3a] (--> /lib64/libpthread.so.0(+0x7dd5)[0x7f3513e3add5] (--> /lib64/libc.so.6(clone+0x6d)[0x7f3513701ead] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory

But I'm not sure what file it is trying to write to.

I have seen online that doing a stat on every file in the mounted filesystem may trigger a heal, although given the size of the volume that may take a while. Do you have any suggestion on how speed up recovery of those last few thousand files?

dalwise avatar Feb 07 '22 22:02 dalwise

By the way, one thing that I have noticed is that even though the number of entries needing healing on host_name21 have remained fairly stable at around 2475, the entries in host_name20 & host_name20 have varied quite much, despite the fact that nothing should currently be writing to the mounted volume. For example just a few minutes ago:

# gluster vol heal vol_name statistics heal-count
Gathering count of entries to be healed on volume vol_name has been successful

Brick host_name21:/shared/.brick
Number of entries: 2474

Brick host_name22:/shared/.brick
Number of entries: 395

Brick host_name20:/shared/.brick
Number of entries: 394

The glustershd log shows heals on all nodes, but the number of entries on host_name21 remains stable. For example:

[root@host_name20 glusterfs]# date
Tue Feb  8 00:16:27 UTC 2022
[root@host_name20 glusterfs]# tail -n3 glustershd.log
[2022-02-08 00:16:28.413337] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 975cbb07-4584-444e-9ead-81d3405b209f. sources=[2]  sinks=0 1
[2022-02-08 00:16:28.415346] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 975cbb07-4584-444e-9ead-81d3405b209f
[2022-02-08 00:16:28.435790] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 975cbb07-4584-444e-9ead-81d3405b209f. sources=[2]  sinks=0 1

dalwise avatar Feb 08 '22 00:02 dalwise

I tried running gluster volume heal vol_name full, but that didn't seem to have any effect.

This command just starts a full scan of the volume checking every file, which means that it may take some time to reach to the problematic files. However, if the files you want to heal already appear in the heal info, a full scan is not really necessary. Self-heal will already try to heal them. If it can't it means that something is wrong with those files and needs to be checked manually.

I am looking for errors in the logs using egrep '\]\sE\s\[' <file>. I've only found recent errors in /var/log/glusterfs/<mount_point>.log, which all look like:

[2022-02-07 19:10:29.206069] E [fuse-bridge.c:227:check_and_dump_fuse_W] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13a)[0x7f3514ff993a] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8221)[0x7f350c3a9221] (--> /usr/lib64/glusterfs/7.4/xlator/mount/fuse.so(+0x8b3a)[0x7f350c3a9b3a] (--> /lib64/libpthread.so.0(+0x7dd5)[0x7f3513e3add5] (--> /lib64/libc.so.6(clone+0x6d)[0x7f3513701ead] ))))) 0-glusterfs-fuse: writing to fuse device failed: No such file or directory

That could happen sometimes when Gluster is trying to invalidate some file that the kernel has already removed from its cache. If there isn't anything else in the logs, I wouldn't care much about this message.

I have seen online that doing a stat on every file in the mounted filesystem may trigger a heal, although given the size of the volume that may take a while. Do you have any suggestion on how speed up recovery of those last few thousand files?

You should run a gluster volume <volname> heal info. It will show you the list of problematic files. Then take one of them and run these commands on it from each brick, not the mount point:

# stat <file>
# getfattr -m. -e hex -d <file>

This will provide the necessary information to fix the that file. Most probably other files will have the a similar problem, so we may be able to apply the same solution to the other files, but first let's see what happens on one of them.

By the way, one thing that I have noticed is that even though the number of entries needing healing on host_name21 have remained fairly stable at around 2475, the entries in host_name20 & host_name20 have varied quite much, despite the fact that nothing should currently be writing to the mounted volume. For example just a few minutes ago:

This is probably caused by the self-heal full command, that has found some additional issues. These could be related to the discrepancies you detected between the two remaining nodes at the beginning, or it could also be related to the problems caused when one of the bricks exhausted all its inodes.

xhernandez avatar Feb 08 '22 16:02 xhernandez

Most of the entries shown on gluster vol heal vol_name info are gfid entries like <gfid:9d83ace4-e904-4868-be69-0032e05acc5f> however a few are filenames, so I'll start with those. All of the non-gfid entries are in the logs.bak2 directory. Here's one that shows listed both for host_name20 and host_name22:

host_name20# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27        	Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d	Inode: 4093747016  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-08 19:12:09.229701706 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228701706 +0000
 Birth: -
host_name20# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
host_name21# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27        	Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d	Inode: 38842218    Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228041551 +0000
 Birth: -
host_name21# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
host_name22# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27        	Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d	Inode: 4129139457  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.227005682 +0000
 Birth: -
host_name22# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4

I am fine deleting the whole logs.bak2 directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?

dalwise avatar Feb 08 '22 19:02 dalwise

Most of the entries shown on gluster vol heal vol_name info are gfid entries like <gfid:9d83ace4-e904-4868-be69-0032e05acc5f> however a few are filenames, so I'll start with those. All of the non-gfid entries are in the logs.bak2 directory. Here's one that shows listed both for host_name20 and host_name22:

host_name20# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27        	Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d	Inode: 4093747016  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-08 19:12:09.229701706 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228701706 +0000
 Birth: -
host_name20# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
host_name21# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27        	Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d	Inode: 38842218    Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.228041551 +0000
 Birth: -
host_name21# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4
host_name22# stat logs.bak2/TW8YX11205001/J184091T50529848
  File: ‘logs.bak2/TW8YX11205001/J184091T50529848’
  Size: 27        	Blocks: 0          IO Block: 4096   directory
Device: fd00h/64768d	Inode: 4129139457  Links: 2
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2021-05-13 15:18:03.976526580 +0000
Modify: 2021-11-11 19:30:30.313584109 +0000
Change: 2022-02-08 19:12:09.227005682 +0000
 Birth: -
host_name22# getfattr -m. -e hex -d logs.bak2/TW8YX11205001/J184091T50529848
# file: logs.bak2/TW8YX11205001/J184091T50529848
trusted.gfid=0x321547d38a7642c4a5fc266aed9bec50
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000618d6f560000000012b0e9ed00000000618d6f560000000012b0e9ed00000000609d432b000000003a349cf4

The data on all 3 bricks seems consistent. Since this is a directory, can you check if the number of entries inside it and its names is the same on all three bricks ?

I am fine deleting the whole logs.bak2 directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?

Yes, that should work. If there's some inconsistency inside the directory itself maybe it's not possible to completely remove the entire directory at once, but only few problematic entries should remain, which can be easily handled.

However, you could also use this case to "experiment" with healing files if you want.

xhernandez avatar Feb 09 '22 11:02 xhernandez

The data on all 3 bricks seems consistent. Since this is a directory, can you check if the number of entries inside it and its names is the same on all three bricks ?

It is not:

[hostname21 logs.bak2]$ ls|wc -l
340803
[hostname21 logs.bak2]$ find . | wc -l
18465612
[hostname22 logs.bak2]$ ls|wc -l
340803
[hostname22 logs.bak2]$ find . | wc -l
18462649

However, you could also use this case to "experiment" with healing files if you want.

I think it may be challenging to select files to heal from this directory, since the files in need of repair in it keep changing. For example, the file that I mentioned last time no longer appears in the output of gluster vol heal vol_name info. Could it be in an unstable state where Gluster could be creating and deleting files in this directory to try to align the bricks but the changes don't converge?

The other possibility is that there is a process actively writing to this directory, but I am not aware of any such process and cannot locate it. I've tried looking through the list of processes and listing the open files through lsof, which only shows gluster processes accessing the directory.

dalwise avatar Feb 11 '22 06:02 dalwise

However, you could also use this case to "experiment" with healing files if you want.

I think it may be challenging to select files to heal from this directory, since the files in need of repair in it keep changing. For example, the file that I mentioned last time no longer appears in the output of gluster vol heal vol_name info. Could it be in an unstable state where Gluster could be creating and deleting files in this directory to try to align the bricks but the changes don't converge?

That shouldn't happen if there isn't anyone accessing the volume (once a file is healed, nothing else inside Gluster touches it). Could it be that the heal full you triggered 4 days ago is still working and finding (and healing) files that were broken but not tracked internally (maybe because of the inode exhaustion that happened some time ago). You can check the glusershd.log on each node to see if there's ongoing activity.

xhernandez avatar Feb 11 '22 10:02 xhernandez

You can check the glusershd.log on each node to see if there's ongoing activity.

There continues to be ongoing activity. It may indeed be due to the full heal that I triggered, since the count of files in need of heal is not changing much, so I'm guessing it's just new files that keep being detected in need of heal. For example:

[hostname20 glusterfs]$ date; tail glustershd.log
Sat Feb 12 02:06:10 UTC 2022
[2022-02-12 02:06:09.018375] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.020454] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8
[2022-02-12 02:06:09.044517] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on 51b71c79-7088-4521-b8a3-afcd5fd275e8. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.068995] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 3-vol_name-replicate-0: performing metadata selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec
[2022-02-12 02:06:09.074776] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.076833] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec
[2022-02-12 02:06:09.096778] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed entry selfheal on e1349b59-fe6b-4361-92ec-e6a9e3e7aeec. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.109857] I [MSGID: 108026] [afr-self-heal-metadata.c:52:__afr_selfheal_metadata_do] 3-vol_name-replicate-0: performing metadata selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c
[2022-02-12 02:06:09.115797] I [MSGID: 108026] [afr-self-heal-common.c:1744:afr_log_selfheal] 3-vol_name-replicate-0: Completed metadata selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c. sources=[2]  sinks=0 1
[2022-02-12 02:06:09.117873] I [MSGID: 108026] [afr-self-heal-entry.c:916:afr_selfheal_entry_do] 3-vol_name-replicate-0: performing entry selfheal on 08376a6d-823c-4750-8f33-2a5782dfd84c

I am fine deleting the whole logs.bak2 directory to get rid of these entries pending heal. Would deleting the directory on the mounted volume be effective in solving the issue for these files?

Yes, that should work. If there's some inconsistency inside the directory itself maybe it's not possible to completely remove the entire directory at once, but only few problematic entries should remain, which can be easily handled.

All of the non-gfid entries in gluster vol heal vol_name info continue to be from the logs.bak2 directory, which I don't really need. Since the entries are varying all the time I don't think they will be useful to practice recovering files, so I'm going to delete this directory on the mounted filesystem and see what remains to be fixed.

dalwise avatar Feb 12 '22 02:02 dalwise

Hi @xhernandez, I'm just providing a quick update.

Removal of logs.bak2 directory has been ongoing for the last few days. Most top level directories within that directory got wiped on the first run (which took a couple of days), but as you had warned there were a number of top level directories that did not get removed (~7000 of them). The rm -rf command showed errors like:

rm: cannot remove ‘logs.bak2/TW9YX12003182/J198476T56175213’: Transport endpoint is not connected

I have continued re-running rm -rf logs.bak2 from within the mounted volume and I'm now down to 1200 top level directories.

At the same time the heal-count has been reduced, which seems to indicate that a lot of the problematic files were in this directory. The exact number changes quickly, but it's now in the hundreds rather than the 2475 that I had before starting deletion of logs.bak2

dalwise avatar Feb 14 '22 20:02 dalwise

Hi @dalwise. Sorry, I couldn't answer earlier.

I would say that as self-heal is progressing, more directories can be removed. If all damaged files and directories are inside logs.bak2 and self-heal keeps progressing, you should be able to delete the remaining files soon. Let me know if something gets stuck.

Removal of logs.bak2 directory has been ongoing for the last few days. Most top level directories within that directory got wiped on the first run (which took a couple of days), but as you had warned there were a number of top level directories that did not get removed (~7000 of them). The rm -rf command showed errors like:

rm: cannot remove ‘logs.bak2/TW9YX12003182/J198476T56175213’: Transport endpoint is not connected

Does this error persist, or after some time it was possible to remove it also ?

xhernandez avatar Feb 15 '22 11:02 xhernandez

Hi @xhernandez,

The Transport endpoint is not connected error prevented me from deleting all entries within logs.bak2. I ended up running the following within the mounted directory. It reduced the number of entries very slowly, but never got below 11:

while ls -ld logs.bak2; do date; rm -rf logs.bak2; done

A few hours after stopping that command the entries in this directory of the mounted volume have increased again:

# ls logs.bak2/ | wc -l
4828

One thing that I have noted is that the number of entries in the brick of hostname21 and hostname22 matches that of the mounted volume. However the brick of hostname20 has many more entries:

# ls /shared/.brick/logs.bak2/ | wc -l
52105

Could Gluster be restoring entries from there? glustershd.log on hostname20 shows ongoing activity and heal-count shows hundreds on entries on hostname20 (none on the others)

I am currently trying the rm -rf command in loop again, but it seems like the directory has entries being created. I still haven't found any process outside of Gluster that may be doing so.

dalwise avatar Feb 16 '22 05:02 dalwise

Hi @xhernandez,

The Transport endpoint is not connected error prevented me from deleting all entries within logs.bak2. I ended up running the following within the mounted directory. It reduced the number of entries very slowly, but never got below 11:

while ls -ld logs.bak2; do date; rm -rf logs.bak2; done

A few hours after stopping that command the entries in this directory of the mounted volume have increased again:

# ls logs.bak2/ | wc -l
4828

That's weird. Once deleted even self-heal shouldn't create directories again.

Can you run gluster volume status <volname> clients and gluster volume status <volname> client-list. It will show who is connected to the bricks. Only the self-heal daemon should appear. If so, you may tray to kill the self-heal process, remove everything you can, and start it again using gluster volume start <volname> force.

One thing that I have noted is that the number of entries in the brick of hostname21 and hostname22 matches that of the mounted volume. However the brick of hostname20 has many more entries:

# ls /shared/.brick/logs.bak2/ | wc -l
52105

Could Gluster be restoring entries from there? glustershd.log on hostname20 shows ongoing activity and heal-count shows hundreds on entries on hostname20 (none on the others)

It's the only option I see. Though if self-heal considers that the contents of that brick are good to be replicated to the other bricks, it should also show those contents when you list the directory through the mount point. Can you provide the output of the following commands from all bricks ?

# stat /shared/.brick/logs.bak2
# getfattr -m. -e hex -d /shared/.brick/logs.bak2

@karthik-us @pranithk do you know why self-heal may be recovering data from a brick that is not used to show the contents from the mount point ?

xhernandez avatar Feb 16 '22 11:02 xhernandez

Thanks for your help @xhernandez.

Before reading your message I kept trying to remove directories, but they have nonetheless increased. Currently all bricks have the same number of entries. Trying to delete them from within the mounted filesystem results in many Transport endpoint is not connected errors:

[hostname20 vol_name]$ ls logs.bak2/|wc -l
45568
[hostname20 ]$ ls /shared/.brick/logs.bak2/ | wc -l
45568
[hostname21 ~]$ ls /shared/.brick/logs.bak2/ | wc -l
45568
[hostname22 ~]$ ls /shared/.brick/logs.bak2/ | wc -l
45568

Can you run gluster volume status clients and gluster volume status client-list. It will show who is connected to the bricks. Only the self-heal daemon should appear. If so, you may tray to kill the self-heal process, remove everything you can, and start it again using gluster volume start force.

Besides the self-heal daemons the fuse mount on each of the 3 nodes is also shown. hostname20 also had an additional fuse mount on another mount point that I have unmounted, but it didn't seem to be in use. I assume that I'll need a fuse mount (at least on one node) to be able to rm the directories. Or did you mean to remove them directly on the bricks? Is the start command only to restart self-heal or did you intend me to run gluster volume stop <volname> force at some earlier point?

[hostname20 ~]$ gluster volume status vol_name clients
Client connections for volume vol_name
----------------------------------------------
Brick : hostname21:/shared/.brick
Clients connected : 6
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
10:32::21:49151                                     212263659412    248686626360           70200
10:32::22:49151                                     198986307580    205689698096           70200
10:32::1:49137                                      386254711468    428001593072           70200
10:32::1:49129                                      244098818470   1620181422056           70200
10:32::21:49145                                      12953535376     28427954548           70200
10:32::22:49144                                            67992           67436           70200
----------------------------------------------
Brick : hostname22:/shared/.brick
Clients connected : 6
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
10:32::21:49143                                     195517413400    200070163428           70200
10:32::22:49150                                     215749549792    251727554036           70200
10:32::1:49141                                      386439663160    426006159192           70200
10:32::1:49127                                      236743313241    730263668496           70200
10:32::21:49142                                      63979330377    142047192344           70200
10:32::22:49141                                            67992           66864           70200
----------------------------------------------
Brick : hostname20:/shared/.brick
Clients connected : 6
Hostname                                               BytesRead    BytesWritten       OpVersion
--------                                               ---------    ------------       ---------
10:32::20:49141                                     348078654668    407153848100           70200
10:32::21:49144                                      69623738472     64403543140           70200
10:32::21:49141                                      84327676813     71605132816           70200
10:32::22:49143                                      69862449248     65162147332           70200
10:32::20:49133                                        111578892        96524416           70200
10:32::22:49140                                            65128           58184           70200
----------------------------------------------

[hostname20 ~]$ gluster volume status vol_name client-list
Client connections for volume vol_name
Name                                                       count
-----                                                     ------
glustershd                                                     3
fuse                                                           3

total clients for volume vol_name : 6
-----------------------------------------------------------------

[hostname20 ~]$

Can you provide the output of the following commands from all bricks ?

[hostname20 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 11071488  	Blocks: 4168       IO Block: 4096   directory
Device: fd00h/64768d	Inode: 1879049437  Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 21:56:56.290330204 +0000
Modify: 2022-02-16 13:19:24.070714399 +0000
Change: 2022-02-16 13:35:35.601728222 +0000
 Birth: -
[hostname20 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.afr.dirty=0x000000000000000000000000
trusted.afr.vol_name-client-0=0x000000000000000000000000
trusted.afr.vol_name-client-1=0x000000000000000000000000
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4
[hostname21 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 1482752   	Blocks: 4024       IO Block: 4096   directory
Device: fd00h/64768d	Inode: 4913161408  Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 15:32:40.938630891 +0000
Modify: 2022-02-16 13:19:24.070489184 +0000
Change: 2022-02-16 13:35:35.599780013 +0000
 Birth: -
[hostname21 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4
[hostname22 ~]$ stat /shared/.brick/logs.bak2
  File: ‘/shared/.brick/logs.bak2’
  Size: 1482752   	Blocks: 4024       IO Block: 4096   directory
Device: fd00h/64768d	Inode: 805307078   Links: 45570
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2022-02-16 15:33:15.305998830 +0000
Modify: 2022-02-16 13:19:24.070489184 +0000
Change: 2022-02-16 13:35:35.599523053 +0000
 Birth: -
[hostname22 ~]$ getfattr -m. -e hex -d /shared/.brick/logs.bak2
getfattr: Removing leading '/' from absolute path names
# file: shared/.brick/logs.bak2
trusted.gfid=0x1ce9de988117431082a26b27b7d33a83
trusted.glusterfs.dht=0x000000000000000000000000ffffffff
trusted.glusterfs.mdata=0x01000000000000000000000000620cf9dc000000000433946000000000620cf9dc0000000004339460000000005e017d52000000001f6cc3e4

dalwise avatar Feb 16 '22 22:02 dalwise