glusterfs [bug:1718562] flock failure (regression)

URL: https://bugzilla.redhat.com/1718562 Creator: jaco at uls.co.za Time: 20190608T16:34:10

Description of problem:

after a small number of flock rounds the lock remains behind indefinitely until cleared with volume clear-locks, whereafter which normal operation resumes again.

I suspect this happens when there is contention on the lock.

I've got a setup where these locks are used syncronization mechanism. So a process on host a will take the lock, and release it on shutdown, at which point another host is likely already trying to obtain the lock, and never manages to do so (clearing granted allows the lock to proceed, but randomly clearing locks is a high-risk operation).

Version-Release number of selected component (if applicable): glusterfs 6.1 (confirmed working correctly on 3.12.3 and 4.0.2, suspected correct on 4.1.5 but no longer have a setup with 4.1.5 around).

How reproducible: Trivial. In the mentioned application it's on almost every single lock attempt as far as I can determine.

Steps to Reproduce:

morpheus ~ # gluster volume info shared

Volume Name: shared Type: Replicate Volume ID: a4410662-b6e0-4ed0-b1e0-a1cbf168029c Status: Started Snapshot Count: 0 Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: morpheus:/mnt/gluster/shared Brick2: r2d2:/mnt/gluster/shared Options Reconfigured: transport.address-family: inet nfs.disable: on

morpheus ~ # mkdir /mnt/t morpheus ~ # mount -t glusterfs localhost:shared /mnt/t morpheus ~ #

r2d2 ~ # mkdir /mnt/t r2d2 ~ # mount -t glusterfs localhost:shared /mnt/t r2d2 ~ #

morpheus ~ # cd /mnt/t/ morpheus ~ # ls -l total 0 morpheus /mnt/t # exec 3>lockfile; c=0; while flock -w 10 -x 3; do (( c++ )); echo "Iteration $c passed"; exec 3<&-; exec 3>lockfile; done; echo "Failed after $c iterations"; exec 3<&- Iteration 1 passed Iteration 2 passed Iteration 3 passed ...

r2d2 /mnt/t # exec 3>lockfile; c=0; while flock -w 10 -x 3; do (( c++ )); echo "Iteration $c passed"; exec 3<&-; exec 3>lockfile; done; echo "Failed after $c iterations"; exec 3<&- Iteration 1 passed Iteration 2 passed Failed after 2 iterations r2d2 /mnt/t #

Iteration 100 passed Iteration 101 passed Iteration 102 passed Failed after 102 iterations morpheus /mnt/t #

The two mounts failed at the same time, morpheus just passed more iterations due to being started first.

Only iterating on one host I've had to stop it with ^C around 10k iterations, which to me is sufficient indication that it's contention related.

After the above failure, I need to either rm the file and then it works again, or I need to issue "gluster volume clear-locks shared /lockfile kind granted posix"

On /tmp on my local machine I can run as much invocations of the loop above as I want without issues (ext4 filesystem).

On glusterfs 3.12.3 and 4.0.2 I tried the above too, and stopped them after 10k iterations.

I have not observed the behaviour on glusterfs 4.1.5 which we used for a very long time.

I either need a fix for this, or a way (prefereably with little no downtime, total around 1.8TB of data) to downgrade glusterfs back to 4.1.X. Or a way to get around this reliably from within my application code (mostly control scripts written in bash).

Mar 12 '20 13:03 gluster-ant

Time: 20190610T09:00:30 atumball at redhat commented: Hi Jaco, thanks for the report. Will update on this soon.

Mar 12 '20 13:03 gluster-ant

Time: 20190610T09:18:12 ykaul at redhat commented: Looks like a simple test worth adding to our CI. We can do with 1000 iterations or so.

Mar 12 '20 13:03 gluster-ant

Time: 20190619T07:11:48 spalai at redhat commented: The issue is reproducible always. Will update once I find the RCA.

Susant

Mar 12 '20 13:03 gluster-ant

Time: 20190619T09:09:56 jaco at uls.co.za commented: I've managed to implement a workaround for this in php/bash (C/C++ will be similar).

This "work around" is perhaps how locking should have been implemented in the first place on our end (lock files gets removed post use).

The code uses a small(ish 1s) timeout per flock() call due to the bug, a more global timeout would be better but given the bug here doesn't work as well as can be done. Recursion can (and should) be eliminated but I haven't spent a lot of time on this (getting it out the door was more urgent than making it optimal). This code does have the single advantage that lock files gets removed post use again (it's based on discussions with other parties).

The other option for folks running into this is to look at dotlockfile(1) which doesn't rely on flock() but has other major timing gap issues (retries are atomic, but waiting is a simple sleep + retry, so if other processes grabs locks at the wrong time the invoking process could starve/fail without the need to do so).

Bash:

#! /bin/sh

function getlock() { local fd="$1" local lockfile="$2" local waittime="$3"

    eval "exec $fd>\"\${lockfile}\"" || return $?
    local inum=$(stat -c%i - <&3)
    local lwait="-w1"
    [ "${waittime}" -le 0 ] && lwait=-n

    while ! flock -w1 -x 3; do
            if [ "$(stat -c%i "${lockfile}" 2>/dev/null)" != "${inum}" ]; then
                    eval "exec $fd>\"\${lockfile}\"" || return $?
                    local inum=$(stat -c%i - <&3)
                    continue
            fi
            (( waittime-- ))
            if [ $waittime -le 0 ]; then
                    eval "exec $fd<&-"
                    return 1
            fi
    done

    if [ "$(stat -c%i "${lockfile}" 2>/dev/null)" != "${inum}" ]; then
            eval "exec $fd<&-"
            getlock "$fd" "$lockfile" "${waittime}"
            return $?
    fi

    return 0

}

function releaselock() { local fd="$1" local lockfile="$2"

    rm "${lockfile}"
    eval "exec $fd<&-"

}

PHP:

filename = $filename; $lock->fp = fopen($filename, "w"); if (!$lock->fp) return NULL; $lstat = fstat($lock->fp); if (!$lstat) { fclose($lock->fp); return NULL; } pcntl_signal(SIGALRM, function() {}, false); pcntl_alarm(1); while (!flock($lock->fp, LOCK_EX)) { pcntl_alarm(0); clearstatcache(true, $filename); $nstat = stat($filename); if (!$nstat || $nstat['ino'] != $lstat['ino']) { fclose($lock->fp); $lock->fp = fopen($filename, "w"); if (!$lock->fp) return NULL; $lstat = fstat($lock->fp); if (!$lstat) { fclose($lock->fp); return NULL; } } if (--$lockwait fp); return NULL; } pcntl_alarm(1); } pcntl_alarm(0); clearstatcache(true, $filename); $nstat = stat($filename); if (!$nstat || $nstat['ino'] != $lstat['ino']) { fclose($lock->fp); return getlock($filename, $lockwait); } return $lock; } function releaselock($lock) { unlink($lock->filename); fclose($lock->fp); } ?>

Mar 12 '20 13:03 gluster-ant

Time: 20190917T21:14:18 jaco at uls.co.za commented: Any progress here? Whilst the workaround is working it ends up leaving a lot of garbage around since the original file really isn't getting unlocked, end up consuming an inode on at least one of the bricks, eventually (on smaller) systems resulting in out of inode situations on the filesystem.

Mar 12 '20 13:03 gluster-ant

Time: 20190918T06:04:43 vpandey at redhat commented: @Jaco Kroon, I will assign this to someone for resolving this issue.

@Susant Can you look into this issue. Its been stale for a long time now.

Mar 12 '20 13:03 gluster-ant

Time: 20190918T06:25:25 spalai at redhat commented: Moving this to lock translator maintainer.

Mar 12 '20 13:03 gluster-ant

Time: 20190918T06:26:20 spalai at redhat commented: Kruthika, could you help with this issue?

Mar 12 '20 13:03 gluster-ant

Hi All,

Just to add to this. Even with the work-arounds deployed, this leaves files behind on the brick. This seems to only affect some of the bricks though (df -i counts differs on the bricks, find .glusterfs -type f -links 1 find a LOT of files).

Fortunately where this bites me most reformatting a brick every now and again is not a major issue. They're only 2GB odd and intended specifically for these lock files. Other environments it's not as trivial and if I need to reformat one of those bricks ... well, lets just say a current rebuild on one of those is at four weeks and based on disk size consumed isn't even at 10% yet.

Apr 05 '20 11:04 jkroonza

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

Nov 01 '20 12:11 stale[bot]

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

Nov 16 '20 14:11 stale[bot]

Still present in 7.8 as such, definitely not resolved. Simply ignoring a bug for more than six months does not just make it go away.

Nov 16 '20 15:11 jkroonza

Re-opening since it has a detailed reproducer. If anyone is working on this, please assign the issue to yourself.

Nov 17 '20 04:11 itisravi

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

Jun 15 '21 10:06 stale[bot]

The update to RHGS 3.5 did the trick - thank you!

~james

From: stale[bot] @.> Sent: Tuesday, June 15, 2021 05:50 To: gluster/glusterfs @.> Cc: Russell, James C @.>; Manual @.> Subject: Re: [gluster/glusterfs] [bug:1718562] flock failure (regression) (#982)

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgluster%2Fglusterfs%2Fissues%2F982%23issuecomment-861396713&data=04%7C01%7C%7C0e1bf3d60d674d7effc608d92feb5613%7C31d7e2a5bdd8414e9e97bea998ebdfe1%7C0%7C0%7C637593510093114633%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yb1swzy6ZaZD3WKsO6PZaMIe2NQCZudvlx3J8j5pbDI%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKLHU7SPZC2ONRL2YX6AJF3TS4V5ZANCNFSM4LGLWTJA&data=04%7C01%7C%7C0e1bf3d60d674d7effc608d92feb5613%7C31d7e2a5bdd8414e9e97bea998ebdfe1%7C0%7C0%7C637593510093124625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=GEswzC47AleV0jpGg0aBhLtl8MENlQbP0BeuTUKZjAU%3D&reserved=0.

Jun 15 '21 14:06 jcrussell42

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

Jan 15 '22 01:01 stale[bot]

I can no longer reproduce the lock failures, however, on glusterfs 8.6 at least there are still files left behind in .glusterfs/unlink/ on repeated lockfiles. This happens only with the open, lock, remove, unlock sequence, ie, something like (ignoring most error conditions):

fd = open("lockfile", O_RDWR | O_CREATE, 0600);
if (flock(fd, LOCK_EX) == 0 && stat("lockfile", &st1) == 0 && fstat(fd, &st2) && st1.st_ino == st2.st_ino) {
    // lock succeeded, do work.
    unlink("lockfile");
}
close(fd);

Then after a few iterations of doing the above on a glusterfs (We've got a locked cron running at 7:00 in the mornings on two hosts, runs are typically in the order of minutes, but starts at 7:00, both sides has to run, so lock timeout of half an hour) filesystem then one starts seeing this:

# ls -lahtr .glusterfs/unlink/ | tail
-rw-r--r--   1 root root    0 Jan 11 07:00 b04c8c63-be55-4126-8431-7fd084401913
-rw-r--r--   1 root root    0 Jan 12 07:00 41bb8e2d-98c4-4104-9620-0b361b890a9e
-rw-r--r--   1 root root    0 Jan 12 07:00 891eb0c1-97d2-4d6e-9058-80b494e4d424
-rw-r--r--   1 root root    0 Jan 13 07:00 438bec3a-6049-42aa-a46f-4a3dcedbb7ba
-rw-r--r--   1 root root    0 Jan 13 07:00 d63293a4-50a5-47b3-af1c-41121685a672
-rw-r--r--   1 root root    0 Jan 14 07:00 d711d696-ed2f-4ac3-b407-78ce2b2a001f
-rw-r--r--   1 root root    0 Jan 14 07:00 b0749f5c-4ed6-4cfe-8f70-ede3f9058063
-rw-r--r--   1 root root    0 Jan 15 07:00 446695e5-24ff-4600-8058-6c679b7b16a1
-rw-r--r--   1 root root    0 Jan 15 07:00 743cb5b1-962a-47cd-85e2-d14fd173484b

So basically the lockfiles are now left behind on one or both of the bricks in the unlink/ folder. There are two ways to clean this up:

replace-brick (which is quite an iterative process) rm .glusterfs/unlink/*

The former is probably more correct but also more time consuming, since the latter is very indiscriminate and could remove links that are actually still required. We do ensure that gluster volume heal ... info at least doesn't show any heals that needs to happen prior to doing that, but still.

I've upgraded one of the affected pairs now to glusterfs 10.0 and will advise if the problem persists.

Jan 15 '22 08:01 jkroonza

I rechecked, I believe there is sufficient grounds for filing a separate bug as this does seem to have been solved, but it introduced some new problems as set out in https://github.com/gluster/glusterfs/issues/3153

Jan 17 '22 12:01 jkroonza

Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.

Sep 21 '22 00:09 stale[bot]

Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.

Oct 22 '22 18:10 stale[bot]

glusterfs glusterfs copied to clipboard

[bug:1718562] flock failure (regression)

glusterfs
glusterfs copied to clipboard