lockfile icon indicating copy to clipboard operation
lockfile copied to clipboard

Suggest retry if lockfile disappears

Open bboreham opened this issue 3 years ago • 6 comments

I saw very occasional failures like open /tmp/ignite-snapshot.lock: no such file or directory", and suspect it could be here.

Seems possible if the lock file was present when we do the Lstat(), but has been removed by the time we get to reading it.

I considered re-doing the switch so this would be a case at top level, but went for minimal lines changed. Let me know if you prefer something else.

bboreham avatar Jan 14 '21 16:01 bboreham

If the lockfile disappears and we cannot find the process owning it between existance and disappearance, the owner is probably not in the same pid space of the test runner or on a remote network file. So that use case of the lockfile package is not supported, since you cannot identify the PID of the owner. @bboreham can you confirm this theory?

nightlyone avatar Feb 04 '21 22:02 nightlyone

I didn’t understand “cannot find the process owning it”.

What if the process has exited?

bboreham avatar Feb 05 '21 19:02 bboreham

We (@symflower) have the same problem: sometimes "no such file or directory" is returned as an error for LockFile.TryLock. The change provided by @bboreham fixes that (THANKS!).

Here is a reproducer which reproduces the problem for me in <1min, usually in a few seconds. With the change the problem is gone. I know it does not make sense to acquire a lock with this package like that, but it reproduces the problem and that was what i am aiming for.

package main

import (
	"fmt"
	"math/rand"
	"os"
	"sync"
	"time"

	"github.com/nightlyone/lockfile"
)

func main() {
	path, err := os.MkdirTemp("", "provoke-go-lockfile")
	if err != nil {
		panic(err)
	}

	var wg sync.WaitGroup
	wg.Add(1)

	for i := 0; i < 10; i++ {
		go func(i int) {
			for {
				fmt.Printf("%d: Create lock\n", i)

				lock, err := lockfile.New(path + "/you-better.lock")
				if err != nil {
					panic(err)
				}

				fmt.Printf("%d: Try to lock\n", i)

				for {
					err = lock.TryLock()
					if err == nil {
						break
					}

					if err != lockfile.ErrBusy && err != lockfile.ErrNotExist {
						panic(err)
					}

					time.Sleep(time.Millisecond)
				}

				fmt.Printf("%d: LOCKED\n", i)

				time.Sleep(time.Duration(rand.Int63n(2)+1) * time.Second)

				fmt.Printf("%d: Let's unlock\n", i)

				lock.Unlock()
			}
		}(i)
	}

	wg.Wait()
}

zimmski avatar Apr 27 '21 19:04 zimmski

I ran into this bug as well...it's pretty easy to reproduce with many processes spamming for the lock.

this is just a race condition in your code, assuming a process p1 already has a lock, if a 2nd process p2 is trying to grab a lock and it successfully reaches the ownership check...that is to say it failed to create a link to it's own pid file, and it verified that the existing lockfile exists, between the call to os.SameFile(fiTmp, fiLock) and the read in GetOwner the lockfile can be deleted by p1.

chancila avatar Jun 18 '21 02:06 chancila

@nightlyone could you take another look at the change and the reproducer I posted. This is still a problem for us.

zimmski avatar Jan 13 '23 08:01 zimmski

hey @nightlyone! we're facing the same issue while trying multiple VM creations in parallel. Would be great if you can review this PR once

vistaarjuneja avatar Feb 24 '23 10:02 vistaarjuneja