go sync: inconsistent mutex state

My code face "sync: inconsistent mutex state" panic, and I'm trying to fix it.

https://github.com/golang/go/blob/d54a9a9c42e751a020308cae296add426b56d0f0/src/sync/mutex.go#L147

I wonder should this line be old = atomic.LoadInt32(&m.state)?

https://github.com/golang/go/blob/d54a9a9c42e751a020308cae296add426b56d0f0/src/sync/mutex.go#L140

It seems like there will be concurrent write in line: 212, and this line cannot guarantee old value is correct.

I found https://groups.google.com/g/golang-nuts/c/Eq4CWTB6LhM/m/F0Lbgnfq4hQJ this post said it's a bug, but 7 years past, it's still there, any purpose? Or is it safe to read int value from a pointer's field?

sorry for my poor english

Sep 24 '20 08:09 UnightSun

This looks like memory corruption. Have you tried running your program under the race detector? See https://blog.golang.org/race-detector .

Sep 24 '20 08:09 davecheney

This looks like memory corruption. Have you tried running your program under the race detector? See https://blog.golang.org/race-detector .

unfortunately my code running in produce environment that I have no fully control, and It cannot be reappeared in dev environment. I wrote a little test case but no race found.

Sep 24 '20 09:09 UnightSun

Can you please provide the full panic message. Thank you

Sep 24 '20 11:09 davecheney

fatal error: sync: inconsistent mutex state

goroutine 3571787 [running]:
runtime.throw(0x4e6ccbc, 0x1e)
        /usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc0021dd378 sp=0xc0021dd348 pc=0x439512
sync.throw(0x4e6ccbc, 0x1e)
        /usr/local/go/src/runtime/panic.go:1102 +0x35 fp=0xc0021dd398 sp=0xc0021dd378 pc=0x46cbb5
sync.(*Mutex).lockSlow(0x8313a38)
        /usr/local/go/src/sync/mutex.go:147 +0x165 fp=0xc0021dd3e0 sp=0xc0021dd398 pc=0x47d2a5
sync.(*Mutex).Lock(...)
        /usr/local/go/src/sync/mutex.go:81
sync.(*Pool).pinSlow(0xc00007ec68, 0x0, 0x0)
        /usr/local/go/src/sync/pool.go:213 +0x271 fp=0xc0021dd460 sp=0xc0021dd3e0 pc=0x47de31
sync.(*Pool).pin(0xc00007ec68, 0xc00bb1fab2, 0x4)
        /usr/local/go/src/sync/pool.go:206 +0x65 fp=0xc0021dd488 sp=0xc0021dd460 pc=0x47db85
sync.(*Pool).Get(0xc00007ec68, 0x113, 0x1)
        /usr/local/go/src/sync/pool.go:128 +0x2f fp=0xc0021dd4d0 sp=0xc0021dd488 pc=0x47d82f

then will be business logic, take something from pool.

my go version is 1.14

Sep 24 '20 12:09 UnightSun

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

Oct 28 '20 18:10 gopherbot

what's the final solution? I catch this bug again. go version go1.16.5 linux/amd64

fatal error: sync: inconsistent mutex state

goroutine 3295163 [running]:
runtime.throw(0xa5268d, 0x1e)
	/home/ubuntu/go/src/runtime/panic.go:1117 +0x72 fp=0xc000623ec8 sp=0xc000623e98 pc=0x436292
sync.throw(0xa5268d, 0x1e)
	/home/ubuntu/go/src/runtime/panic.go:1103 +0x35 fp=0xc000623ee8 sp=0xc000623ec8 pc=0x467575
sync.(*Mutex).lockSlow(0x1582030)
	/home/ubuntu/go/src/sync/mutex.go:125 +0x1fb fp=0xc000623f38 sp=0xc000623ee8 pc=0x483f5b
sync.(*Mutex).Lock(...)
	/home/ubuntu/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0x1582030)
	/home/ubuntu/go/src/sync/rwmutex.go:111 +0x90 fp=0xc000623f60 sp=0xc000623f38 pc=0x485570
github.com/hftquanter/hftquant/dump.Update(0x405ea91d9571ecac, 0x0, 0x3fb70a3d70a3d70a, 0xc014000000000000, 0x4032a872b020c49c, 0x0, 0x40329d4fdf3b645a, 0x0)
	/home/ubuntu/gopath/src/github.com/hftquanter/hftquant/dump/dump.go:27 +0x3e fp=0xc000623fa0 sp=0xc000623f60 pc=0x7e385e
runtime.goexit()
	/home/ubuntu/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000623fa8 sp=0xc000623fa0 pc=0x46bfa1

Jul 12 '21 13:07 beaquant

@beaquant can you provide a complete program that demonstrates the problem?

cc @ianlancetaylor

Jul 12 '21 17:07 networkimprov

It's very unlikely that we will be able to fix this issue if we can't reproduce it.

I strongly encourage you to find a way to run the program under the race detector, as that is the most likely cause of the problem. Nobody else is reporting this problem, so the odds are against it being a bug in the standard library. In any case, if it is a bug in the standard library, we will need a way to reproduce the problem so that we can analyze it.

The lines old = m.state in sync/mutex.go are fine. The old value is only used as an input to atomic.CompareAndSwapInt32.

Jul 13 '21 03:07 ianlancetaylor

gopherbot, please don't do that.

Jul 13 '21 03:07 ianlancetaylor

for " The lines old = m.state in sync/mutex.go are fine. The old value is only used as an input to atomic.CompareAndSwapInt32. "

It's very unlikely that we will be able to fix this issue if we can't reproduce it.

I strongly encourage you to find a way to run the program under the race detector, as that is the most likely cause of the problem. Nobody else is reporting this problem, so the odds are against it being a bug in the standard library. In any case, if it is a bug in the standard library, we will need a way to reproduce the problem so that we can analyze it.

The lines old = m.state in sync/mutex.go are fine. The old value is only used as an input to atomic.CompareAndSwapInt32.

this said The old value is only used as an input to atomic.CompareAndSwapInt32. it may ignore line 97 " if old&(mutexLocked|mutexStarving) == mutexLocked && runtime_canSpin(iter) { // Active spinning makes sense. // Try to set mutexWoken flag to inform Unlock // to not wake other blocked goroutines. if !awoke && old&mutexWoken == 0 && old>>mutexWaiterShift != 0 && atomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken) { awoke = true } runtime_doSpin() iter++ old = m.state continue } "

old>>mutexWaiterShift != 0

here also use old value.

and the old of state may also change when awakened. and the number of waiter will also change.

I think direct old = m.state is not safe in a concurrent scenario，why not old = atomic.LoadInt32(&m.state)? This makes me quite confused.

Aug 02 '21 07:08 darellwu

The expression old>>mutexWaiterShift != 0 may use an older value of old. That is true. But it doesn't matter because the same condition also calls atomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken). If the value of old is out of date, the call to atomic.CompareAndSwapInt32 will fail and the code will continue executing.

and the old of state may also change when awakened. and the number of waiter will also change.

I'm sorry, I don't understand how this could happen. The lockSlow method only sleeps at one point, at the call to runtime_SemacquireMutex. It immediately reloads old after it wakes up. And while it may possibly pick up an out-of-date value of m.state, it still doesn't matter.

I want to repeat that we are not seeing any bug reports about mutexes other than this one. Perhaps there is something wrong with this code, but it is not showing up as problems that people are reporting. If there is a real problem here, it should be possible to write a program that reproduces the problem.

Aug 02 '21 21:08 ianlancetaylor

The expression old>>mutexWaiterShift != 0 may use an older value of old. That is true. But it doesn't matter because the same condition also calls atomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken). If the value of old is out of date, the call to atomic.CompareAndSwapInt32 will fail and the code will continue executing.

and the old of state may also change when awakened. and the number of waiter will also change.

I'm sorry, I don't understand how this could happen. The lockSlow method only sleeps at one point, at the call to runtime_SemacquireMutex. It immediately reloads old after it wakes up. And while it may possibly pick up an out-of-date value of m.state, it still doesn't matter.

I want to repeat that we are not seeing any bug reports about mutexes other than this one. Perhaps there is something wrong with this code, but it is not showing up as problems that people are reporting. If there is a real problem here, it should be possible to write a program that reproduces the problem.

After all, using a variable is for reading and writing. “ If there is a real problem here, it should be possible to write a program that reproduces the problem. ” The point of attention in the sentence you said above is completely written in this respect. I think the starting point here is correct.

But you have a lot of reading judgments that depend on this variable， for example:

            // Don't try to acquire starving mutex, new arriving goroutines must queue.
	if old&mutexStarving == 0 {
		new |= mutexLocked
	}
	if old&(mutexLocked|mutexStarving) != 0 {
		new += 1 << mutexWaiterShift
	}
	// The current goroutine switches mutex to starvation mode.
	// But if the mutex is currently unlocked, don't do the switch.
	// Unlock expects that starving mutex has waiters, which will not
	// be true in this case.
	if starving && old&mutexLocked != 0 {
		new |= mutexStarving
	}

If there are even more cases, this is an expired value，is this judgment meaningful？（Does this fit the contextual business semantics?） And, will a goroutine(suppose it can no longer spin) always fail to meet the conditions of line 129 ‘atomic.CompareAndSwapInt32(&m.state, old, new)’?

Will it be in a for loop instead of waiting in runtime_SemacquireMutex? so I still think it is dangerous to read without synchronization 'old = m.state'.

I think the reason why bugs do not often appear is because of the previous spin, and the other is that the goroutine is scheduled once every 10ms.

Thank you！

Aug 06 '21 08:08 darellwu

Changing the variable new does nothing by itself. The only way that new affects anything is if the atomic.CompareAndSwapInt32 succeeds. And that will only succeed if m.state still holds the value old, the value that caused the changes to new to occur.

Aug 06 '21 17:08 ianlancetaylor

I don't think we're getting to the same point. If the atomic.CompareAndSwapInt32 is successful, the state will become the new value. This means that the new value also becomes the old value. We need to loop and think again, and go back to what I said above.

Thanks.

Aug 07 '21 14:08 darellwu

hello, has this been resolved?

Sep 22 '22 11:09 bingzhuo2008

I also had this issue by using sync.pool

Sep 25 '22 14:09 emptyfort

I also had this issue by using sync.pool

I don't think that will help any. Note comment above:

I want to repeat that we are not seeing any bug reports about mutexes other than this one. Perhaps there is something wrong with this code, but it is not showing up as problems that people are reporting. If there is a real problem here, it should be possible to write a program that reproduces the problem.

Sep 25 '22 15:09 andig

I changed 'throw("sync: inconsistent mutex state")' to 'throw("sync: inconsistent mutex state, new="+strconv.FormatInt(new,10))' and it printed a huge number such as 2064149489, does that mean there is hundred millions of goroutines using the mutex?

Oct 17 '22 12:10 bingzhuo2008

Either that, or there is some corruption happening that is overwriting the mutex. I would bet on the latter.

Oct 17 '22 15:10 randall77

I don't think we're getting to the same point. If the atomic.CompareAndSwapInt32 is successful, the state will become the new value. This means that the new value also becomes the old value. We need to loop and think again, and go back to what I said above.

Thanks.

have you solve this problem?

Oct 31 '22 08:10 mountaincleanwater

I solved by removing some concurrent function call which may cause data race. But I don't know how this affected sync.mutex?

func GetCostTreeFromCtx(ctx context.Context) *CostTree {
	switch c := ctx.(type) {
	case *blademaster.Context:
		i, e := c.Get(_ctxTag)
		if e {
			ct := i.(*CostTree)
			ct.CheckPoint = time.Now()  // this line may cause data race
			return ct
		}
	}
	return NewCostTree()
}

Feb 02 '23 09:02 bingzhuo2008