go
go copied to clipboard
sync: inconsistent mutex state
My code face "sync: inconsistent mutex state" panic, and I'm trying to fix it.
https://github.com/golang/go/blob/d54a9a9c42e751a020308cae296add426b56d0f0/src/sync/mutex.go#L147
I wonder should this line be old = atomic.LoadInt32(&m.state)?
https://github.com/golang/go/blob/d54a9a9c42e751a020308cae296add426b56d0f0/src/sync/mutex.go#L140
It seems like there will be concurrent write in line: 212, and this line cannot guarantee old value is correct.
I found https://groups.google.com/g/golang-nuts/c/Eq4CWTB6LhM/m/F0Lbgnfq4hQJ this post said it's a bug, but 7 years past, it's still there, any purpose? Or is it safe to read int value from a pointer's field?
sorry for my poor english
This looks like memory corruption. Have you tried running your program under the race detector? See https://blog.golang.org/race-detector .
This looks like memory corruption. Have you tried running your program under the race detector? See https://blog.golang.org/race-detector .
unfortunately my code running in produce environment that I have no fully control, and It cannot be reappeared in dev environment. I wrote a little test case but no race found.
Can you please provide the full panic message. Thank you
fatal error: sync: inconsistent mutex state
goroutine 3571787 [running]:
runtime.throw(0x4e6ccbc, 0x1e)
/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc0021dd378 sp=0xc0021dd348 pc=0x439512
sync.throw(0x4e6ccbc, 0x1e)
/usr/local/go/src/runtime/panic.go:1102 +0x35 fp=0xc0021dd398 sp=0xc0021dd378 pc=0x46cbb5
sync.(*Mutex).lockSlow(0x8313a38)
/usr/local/go/src/sync/mutex.go:147 +0x165 fp=0xc0021dd3e0 sp=0xc0021dd398 pc=0x47d2a5
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:81
sync.(*Pool).pinSlow(0xc00007ec68, 0x0, 0x0)
/usr/local/go/src/sync/pool.go:213 +0x271 fp=0xc0021dd460 sp=0xc0021dd3e0 pc=0x47de31
sync.(*Pool).pin(0xc00007ec68, 0xc00bb1fab2, 0x4)
/usr/local/go/src/sync/pool.go:206 +0x65 fp=0xc0021dd488 sp=0xc0021dd460 pc=0x47db85
sync.(*Pool).Get(0xc00007ec68, 0x113, 0x1)
/usr/local/go/src/sync/pool.go:128 +0x2f fp=0xc0021dd4d0 sp=0xc0021dd488 pc=0x47d82f
then will be business logic, take something from pool.
my go version is 1.14
Timed out in state WaitingForInfo. Closing.
(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)
what's the final solution? I catch this bug again. go version go1.16.5 linux/amd64
fatal error: sync: inconsistent mutex state
goroutine 3295163 [running]:
runtime.throw(0xa5268d, 0x1e)
/home/ubuntu/go/src/runtime/panic.go:1117 +0x72 fp=0xc000623ec8 sp=0xc000623e98 pc=0x436292
sync.throw(0xa5268d, 0x1e)
/home/ubuntu/go/src/runtime/panic.go:1103 +0x35 fp=0xc000623ee8 sp=0xc000623ec8 pc=0x467575
sync.(*Mutex).lockSlow(0x1582030)
/home/ubuntu/go/src/sync/mutex.go:125 +0x1fb fp=0xc000623f38 sp=0xc000623ee8 pc=0x483f5b
sync.(*Mutex).Lock(...)
/home/ubuntu/go/src/sync/mutex.go:81
sync.(*RWMutex).Lock(0x1582030)
/home/ubuntu/go/src/sync/rwmutex.go:111 +0x90 fp=0xc000623f60 sp=0xc000623f38 pc=0x485570
github.com/hftquanter/hftquant/dump.Update(0x405ea91d9571ecac, 0x0, 0x3fb70a3d70a3d70a, 0xc014000000000000, 0x4032a872b020c49c, 0x0, 0x40329d4fdf3b645a, 0x0)
/home/ubuntu/gopath/src/github.com/hftquanter/hftquant/dump/dump.go:27 +0x3e fp=0xc000623fa0 sp=0xc000623f60 pc=0x7e385e
runtime.goexit()
/home/ubuntu/go/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc000623fa8 sp=0xc000623fa0 pc=0x46bfa1
@beaquant can you provide a complete program that demonstrates the problem?
cc @ianlancetaylor
It's very unlikely that we will be able to fix this issue if we can't reproduce it.
I strongly encourage you to find a way to run the program under the race detector, as that is the most likely cause of the problem. Nobody else is reporting this problem, so the odds are against it being a bug in the standard library. In any case, if it is a bug in the standard library, we will need a way to reproduce the problem so that we can analyze it.
The lines old = m.state in sync/mutex.go are fine. The old value is only used as an input to atomic.CompareAndSwapInt32.
gopherbot, please don't do that.
for " The lines old = m.state in sync/mutex.go are fine. The old value is only used as an input to atomic.CompareAndSwapInt32. "
It's very unlikely that we will be able to fix this issue if we can't reproduce it.
I strongly encourage you to find a way to run the program under the race detector, as that is the most likely cause of the problem. Nobody else is reporting this problem, so the odds are against it being a bug in the standard library. In any case, if it is a bug in the standard library, we will need a way to reproduce the problem so that we can analyze it.
The lines
old = m.statein sync/mutex.go are fine. Theoldvalue is only used as an input toatomic.CompareAndSwapInt32.
this said The old value is only used as an input to atomic.CompareAndSwapInt32.
it may ignore line 97
"
if old&(mutexLocked|mutexStarving) == mutexLocked && runtime_canSpin(iter) {
// Active spinning makes sense.
// Try to set mutexWoken flag to inform Unlock
// to not wake other blocked goroutines.
if !awoke && old&mutexWoken == 0 && old>>mutexWaiterShift != 0 &&
atomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken) {
awoke = true
}
runtime_doSpin()
iter++
old = m.state
continue
}
"
old>>mutexWaiterShift != 0
here also use old value.
and the old of state may also change when awakened.
and the number of waiter will also change.
I think direct old = m.state is not safe in a concurrent scenario,why not old = atomic.LoadInt32(&m.state)?
This makes me quite confused.
The expression old>>mutexWaiterShift != 0 may use an older value of old. That is true. But it doesn't matter because the same condition also calls atomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken). If the value of old is out of date, the call to atomic.CompareAndSwapInt32 will fail and the code will continue executing.
and the old of state may also change when awakened. and the number of waiter will also change.
I'm sorry, I don't understand how this could happen. The lockSlow method only sleeps at one point, at the call to runtime_SemacquireMutex. It immediately reloads old after it wakes up. And while it may possibly pick up an out-of-date value of m.state, it still doesn't matter.
I want to repeat that we are not seeing any bug reports about mutexes other than this one. Perhaps there is something wrong with this code, but it is not showing up as problems that people are reporting. If there is a real problem here, it should be possible to write a program that reproduces the problem.
The expression
old>>mutexWaiterShift != 0may use an older value ofold. That is true. But it doesn't matter because the same condition also callsatomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken). If the value ofoldis out of date, the call toatomic.CompareAndSwapInt32will fail and the code will continue executing.and the old of state may also change when awakened. and the number of waiter will also change.
I'm sorry, I don't understand how this could happen. The
lockSlowmethod only sleeps at one point, at the call toruntime_SemacquireMutex. It immediately reloadsoldafter it wakes up. And while it may possibly pick up an out-of-date value ofm.state, it still doesn't matter.I want to repeat that we are not seeing any bug reports about mutexes other than this one. Perhaps there is something wrong with this code, but it is not showing up as problems that people are reporting. If there is a real problem here, it should be possible to write a program that reproduces the problem.
After all, using a variable is for reading and writing. “ If there is a real problem here, it should be possible to write a program that reproduces the problem. ” The point of attention in the sentence you said above is completely written in this respect. I think the starting point here is correct.
But you have a lot of reading judgments that depend on this variable, for example:
// Don't try to acquire starving mutex, new arriving goroutines must queue.
if old&mutexStarving == 0 {
new |= mutexLocked
}
if old&(mutexLocked|mutexStarving) != 0 {
new += 1 << mutexWaiterShift
}
// The current goroutine switches mutex to starvation mode.
// But if the mutex is currently unlocked, don't do the switch.
// Unlock expects that starving mutex has waiters, which will not
// be true in this case.
if starving && old&mutexLocked != 0 {
new |= mutexStarving
}
If there are even more cases, this is an expired value,is this judgment meaningful?(Does this fit the contextual business semantics?) And, will a goroutine(suppose it can no longer spin) always fail to meet the conditions of line 129 ‘atomic.CompareAndSwapInt32(&m.state, old, new)’?
Will it be in a for loop instead of waiting in runtime_SemacquireMutex? so I still think it is dangerous to read without synchronization 'old = m.state'.
I think the reason why bugs do not often appear is because of the previous spin, and the other is that the goroutine is scheduled once every 10ms.
Thank you!
Changing the variable new does nothing by itself. The only way that new affects anything is if the atomic.CompareAndSwapInt32 succeeds. And that will only succeed if m.state still holds the value old, the value that caused the changes to new to occur.
I don't think we're getting to the same point. If the atomic.CompareAndSwapInt32 is successful, the state will become the new value. This means that the new value also becomes the old value. We need to loop and think again, and go back to what I said above.
Thanks.
hello, has this been resolved?
I also had this issue by using sync.pool
I also had this issue by using sync.pool
I don't think that will help any. Note comment above:
I want to repeat that we are not seeing any bug reports about mutexes other than this one. Perhaps there is something wrong with this code, but it is not showing up as problems that people are reporting. If there is a real problem here, it should be possible to write a program that reproduces the problem.
I changed 'throw("sync: inconsistent mutex state")' to 'throw("sync: inconsistent mutex state, new="+strconv.FormatInt(new,10))' and it printed a huge number such as 2064149489, does that mean there is hundred millions of goroutines using the mutex?
Either that, or there is some corruption happening that is overwriting the mutex. I would bet on the latter.
I don't think we're getting to the same point. If the atomic.CompareAndSwapInt32 is successful, the state will become the new value. This means that the new value also becomes the old value. We need to loop and think again, and go back to what I said above.
Thanks.
have you solve this problem?
I solved by removing some concurrent function call which may cause data race. But I don't know how this affected sync.mutex?
func GetCostTreeFromCtx(ctx context.Context) *CostTree {
switch c := ctx.(type) {
case *blademaster.Context:
i, e := c.Get(_ctxTag)
if e {
ct := i.(*CostTree)
ct.CheckPoint = time.Now() // this line may cause data race
return ct
}
}
return NewCostTree()
}