mgmt icon indicating copy to clipboard operation
mgmt copied to clipboard

[bug] I got "panic: close of closed channel" running "mgmt run"

Open tpetrosy opened this issue 5 years ago • 6 comments

Versions:

  • mgmt version (eg: mgmt --version): root@debian:~/mgmt# mgmt --version mgmt version 0.0.21

root@debian:~/mgmt# dpkg -l |grep mgmt ii mgmt 0.0.21

  • operating system/distribution (eg: uname -a): root@debian:~/mgmt# uname -a Linux debian 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u1 (2019-09-20) x86_64 GNU/Linux

  • golang version (eg: go version): root@debian:~/mgmt# go version go version go1.11.6 linux/amd64

Description:root@debian:~/mgmt# mgmt run --tmp-prefix --no-pgp lang examples/lang/datetime3.mcl

2019-11-05 15:51:06.162382 I | cli: lang: lexing/parsing... 2019-11-05 15:51:09.520982 I | cli: lang: init... 2019-11-05 15:51:09.521205 I | cli: lang: interpolating... 2019-11-05 15:51:09.521328 I | cli: lang: building scope... 2019-11-05 15:51:10.632517 I | cli: lang: ast: init... 2019-11-05 15:51:10.632652 I | cli: lang: ast: interpolating... 2019-11-05 15:51:10.632718 I | cli: lang: ast: building scope... 2019-11-05 15:51:12.713864 I | cli: lang: ast: init... 2019-11-05 15:51:12.714013 I | cli: lang: ast: interpolating... 2019-11-05 15:51:12.714093 I | cli: lang: ast: building scope... 2019-11-05 15:51:12.714784 I | cli: lang: running type unification... 2019-11-05 15:51:12.715421 I | cli: lang: input: examples/lang/datetime3.mcl 2019-11-05 15:51:12.715518 I | cli: lang: tree: . ├── datetime3.mcl └── metadata.yaml This is: mgmt, version: 0.0.21 Copyright (C) 2013-2019+ James Shubin and the project contributors Written by James Shubin [email protected] and the project contributors 15:51:12 main: start: 1572987072715739864 15:51:12 main: warning: working prefix directory is temporary! 15:51:12 main: working prefix is: /tmp/mgmt-debian-471062176 15:51:12 main: no seeds specified! 15:51:12 etcd: running... 15:51:12 etcd: watching chooser... 15:51:12 etcd: bootstrapping... 15:51:12 etcd: nominated: debian=http://localhost:2380 15:51:12 etcd: waiting for server... 15:51:12 etcd: server: runServer: (newCluster=true): debian=http://localhost:2380 15:51:12 etcd: server: starting... 15:51:12 etcd: server: ready 15:51:12 etcd: connect... 15:51:12 etcd: connected! 15:51:12 etcd: chooser: set dynamic cluster size to: 5 15:51:12 etcd: watching nominees... 15:51:12 etcd: nominated: debian=http://localhost:2380 15:51:12 etcd: watching volunteers... 15:51:12 etcd: volunteering... 15:51:12 etcd: volunteers: debian=http://localhost:2380 15:51:12 etcd: chooser: (debian=http://localhost:2380)/(debian=http://localhost:2380) 15:51:12 etcd: chooser result(+/-): []/[] 15:51:12 etcd: watching endpoints... 15:51:12 main: etcd is ready! 15:51:12 main: waiting... 15:51:12 main: running... 15:51:12 main: waiting... 15:51:12 gapi: generating new graph... 15:51:12 gapi: swap! 15:51:12 gapi: lang: lexing/parsing... 15:51:12 etcd: volunteers: debian=http://localhost:2380 15:51:12 etcd: chooser: (debian=http://localhost:2380)/(debian=http://localhost:2380) 15:51:12 etcd: chooser result(+/-): []/[] 15:51:16 gapi: lang: init... 15:51:16 gapi: lang: interpolating... 15:51:16 gapi: lang: building scope... 15:51:17 gapi: lang: ast: init... 15:51:17 gapi: lang: ast: interpolating... 15:51:17 gapi: lang: ast: building scope... 15:51:19 gapi: lang: ast: init... 15:51:19 gapi: lang: ast: interpolating... 15:51:19 gapi: lang: ast: building scope... 15:51:19 gapi: lang: running type unification... 15:51:19 gapi: lang: building function graph... 15:51:19 gapi: lang: function engine initializing... 15:51:19 gapi: lang: function engine validating... 15:51:19 gapi: lang: function engine starting... 15:51:19 gapi: lang: stream... 15:51:19 gapi: lang: loop... 15:51:19 gapi: lang: funcs: func str("Now + 1 year is: {{ .year }} seconds, aka: {{ datetime_print .year }}\n\nload average: {{ .load }}\n\nvu: {{ .vumeter }}\n") started 15:51:19 gapi: lang: funcs: func str("Now + 1 year is: {{ .year }} seconds, aka: {{ datetime_print .year }}\n\nload average: {{ .load }}\n\nvu: {{ .vumeter }}\n") changed 15:51:19 gapi: lang: funcs: func int(24) started 15:51:19 gapi: lang: funcs: func int(24) changed 15:51:19 gapi: lang: funcs: func str("*") started 15:51:19 gapi: lang: funcs: func str("*") changed 15:51:19 gapi: lang: funcs: func str("exists") started 15:51:19 gapi: lang: funcs: func str("exists") changed 15:51:19 gapi: lang: funcs: func str("/tmp/mgmt/datetime") started 15:51:19 gapi: lang: funcs: func str("/tmp/mgmt/datetime") changed 15:51:19 gapi: lang: funcs: func str("+") started 15:51:19 gapi: lang: funcs: func str("+") changed 15:51:19 gapi: lang: funcs: func str("*") started 15:51:19 gapi: lang: funcs: func str("*") changed 15:51:19 gapi: lang: funcs: func int(60) started 15:51:19 gapi: lang: funcs: func int(60) changed 15:51:19 gapi: lang: funcs: func str("====") started 15:51:19 gapi: lang: funcs: func str("====") changed 15:51:19 gapi: lang: funcs: func int(365) started 15:51:19 gapi: lang: funcs: func int(365) changed 15:51:19 gapi: lang: funcs: func call:datetime.now() started 15:51:19 gapi: lang: funcs: func call:datetime.now() changed 15:51:19 gapi: lang: funcs: func int(10) started 15:51:19 gapi: lang: funcs: func int(10) changed 15:51:19 gapi: lang: funcs: func str("x1") started 15:51:19 gapi: lang: funcs: func str("x1") changed 15:51:19 gapi: lang: funcs: func float(0.9) started 15:51:19 gapi: lang: funcs: func float(0.9) changed 15:51:19 gapi: lang: funcs: func str("====") stopped 15:51:19 gapi: lang: funcs: func int(10) stopped 15:51:19 gapi: lang: funcs: func call:example.vumeter(str("===="), int(10), float(0.9)) started 15:51:19 gapi: lang: funcs: func call:example.vumeter(str("===="), int(10), float(0.9)) changed 15:51:19 gapi: lang: funcs: func var(vumeter) started 15:51:19 gapi: lang: funcs: func var(vumeter) changed 15:51:19 gapi: lang: funcs: func int(60) started 15:51:19 gapi: lang: funcs: func int(60) changed 15:51:19 gapi: lang: funcs: func str("*") stopped 15:51:19 gapi: lang: funcs: func int(60) stopped 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), int(60), int(60)) started 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), int(60), int(60)) changed 15:51:19 gapi: lang: funcs: func str("*") started 15:51:19 gapi: lang: funcs: func str("*") changed 15:51:19 gapi: lang: funcs: func int(24) stopped 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), call:_operator(str("*"), int(60), int(60)), int(24)) started 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), call:_operator(str("*"), int(60), int(60)), int(24)) changed 15:51:19 gapi: lang: funcs: func str("*") stopped 15:51:19 gapi: lang: funcs: func int(365) stopped 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), call:_operator(str("*"), call:_operator(str("*"), int(60), int(60)), int(24)), int(365)) started 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), call:_operator(str("*"), call:_operator(str("*"), int(60), int(60)), int(24)), int(365)) changed 15:51:19 gapi: lang: funcs: func var(ayear) started 15:51:19 gapi: lang: funcs: func var(ayear) changed 15:51:19 gapi: lang: funcs: func str("+") stopped 15:51:19 gapi: lang: funcs: func call:_operator(str("+"), call:datetime.now(), var(ayear)) started 15:51:19 gapi: lang: funcs: func call:_operator(str("+"), call:datetime.now(), var(ayear)) changed 15:51:19 gapi: lang: funcs: func var(secplusone) started 15:51:19 gapi: lang: funcs: func var(secplusone) changed 15:51:19 gapi: lang: funcs: func call:sys.load() started 15:51:19 gapi: lang: funcs: func call:sys.load() changed 15:51:19 gapi: lang: funcs: func str("x1") stopped 15:51:19 gapi: lang: funcs: func call:structlookup(call:sys.load(), str("x1")) started 15:51:19 gapi: lang: funcs: func call:structlookup(call:sys.load(), str("x1")) changed 15:51:19 gapi: lang: funcs: func var(theload) started 15:51:19 gapi: lang: funcs: func var(theload) changed 15:51:19 gapi: lang: funcs: func struct(year: var(secplusone); load: var(theload); vumeter: var(vumeter)) started 15:51:19 gapi: lang: funcs: func struct(year: var(secplusone); load: var(theload); vumeter: var(vumeter)) changed 15:51:19 gapi: lang: funcs: func var(tmplvalues) started 15:51:19 gapi: lang: funcs: func var(tmplvalues) changed 15:51:19 gapi: lang: funcs: func str("Now + 1 year is: {{ .year }} seconds, aka: {{ datetime_print .year }}\n\nload average: {{ .load }}\n\nvu: {{ .vumeter }}\n") stopped 15:51:19 gapi: lang: funcs: func call:template(str("Now + 1 year is: {{ .year }} seconds, aka: {{ datetime_print .year }}\n\nload average: {{ .load }}\n\nvu: {{ .vumeter }}\n"), var(tmplvalues)) started 15:51:19 gapi: lang: funcs: func call:template(str("Now + 1 year is: {{ .year }} seconds, aka: {{ datetime_print .year }}\n\nload average: {{ .load }}\n\nvu: {{ .vumeter }}\n"), var(tmplvalues)) changed 15:51:19 gapi: lang: funcs: func str("/tmp/mgmt/datetime") stopped 15:51:19 gapi: lang: funcs: func int(60) stopped 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), int(60), int(60)) stopped 15:51:19 gapi: lang: funcs: func str("*") stopped 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), call:_operator(str("*"), int(60), int(60)), int(24)) stopped 15:51:19 gapi: lang: funcs: func call:_operator(str("*"), call:_operator(str("*"), call:_operator(str("*"), int(60), int(60)), int(24)), int(365)) stopped 15:51:19 gapi: lang: funcs: func var(ayear) stopped 15:51:19 gapi: lang: funcs: func str("exists") stopped 15:51:19 gapi: generating new graph... 15:51:19 gapi: generating new graph... 15:51:19 gapi: lang: running interpret... 15:51:19 engine: autoedge: adding autoedges... 15:51:19 engine: autogroup: algorithm: wrappedGrouper: NonReachabilityGrouper... 15:51:19 main: commit... 15:51:19 engine: graph sync... 15:51:19 main: graph: Vertices(1), Edges(0) 15:51:19 main: waiting... 15:51:19 gapi: lang: running interpret... 15:51:19 engine: autoedge: adding autoedges... 15:51:19 engine: autogroup: algorithm: wrappedGrouper: NonReachabilityGrouper... 15:51:19 gapi: generating new graph... 15:51:19 engine: Worker(file[/tmp/mgmt/datetime]) 15:51:19 engine: Watch(file[/tmp/mgmt/datetime]) 15:51:19 engine: file[/tmp/mgmt/datetime]: CheckApply(true) 15:51:19 engine: file[/tmp/mgmt/datetime]: resource: contentCheckApply(true) 15:51:19 engine: file[/tmp/mgmt/datetime]: CheckApply(true): Return(false, open /tmp/mgmt/datetime: no such file or directory) 15:51:19 main: commit... 15:51:19 engine: graph sync... 15:51:19 main: graph: Vertices(1), Edges(0) 15:51:19 main: waiting... 15:51:19 gapi: lang: running interpret... 15:51:19 engine: autoedge: adding autoedges... 15:51:19 engine: autogroup: algorithm: wrappedGrouper: NonReachabilityGrouper... 15:51:19 main: loop: exited panic: close of closed channel

goroutine 239 [running]: github.com/purpleidea/mgmt/engine/graph.(*State).Pause(0xc00087a200, 0xc0003644e0, 0xc000013d18) /root/go/src/github.com/purpleidea/mgmt/engine/graph/state.go:294 +0xef github.com/purpleidea/mgmt/engine/graph.(*Engine).Pause(0xc000304460, 0x0, 0x0, 0x0) /root/go/src/github.com/purpleidea/mgmt/engine/graph/engine.go:392 +0xcf github.com/purpleidea/mgmt/lib.(*Main).Run.func15(0x1b11b18, 0xc0000d44e0, 0xc00000e8d8, 0xc000797200, 0xc00000e8f0, 0xc00012afc0, 0xc0000d44b0, 0x6, 0xc0007efc80, 0xc0002ebf60, ...) /root/go/src/github.com/purpleidea/mgmt/lib/main.go:734 +0x11d0 created by github.com/purpleidea/mgmt/lib.(*Main).Run /root/go/src/github.com/purpleidea/mgmt/lib/main.go:522 +0x13fb root@debian:~/mgmt#

tpetrosy avatar Nov 05 '19 20:11 tpetrosy

Thanks for the report! As discussed in the workshop, this is the bug I discovered the other day. I'll be on it right after DOD Montreal finishes tomorrow. For some reason, it didn't show up until golang 1.12.x+ so I'm guessing something changed in the scheduler and we missed a race.

Thanks for the report!

purpleidea avatar Nov 06 '19 02:11 purpleidea

/cc @frebib

purpleidea avatar Nov 26 '20 18:11 purpleidea

Some context:

Most of the relevant code is probably in here.

https://github.com/purpleidea/mgmt/tree/master/engine/graph

More specifically:

https://github.com/purpleidea/mgmt/blob/master/engine/graph/actions.go and: https://github.com/purpleidea/mgmt/blob/master/engine/graph/state.go

but maybe also: https://github.com/purpleidea/mgmt/blob/master/engine/graph/engine.go

I wouldn't be surprised if the issue is related to this Pause bit: https://github.com/purpleidea/mgmt/blob/4a2864701c633501b731ba98b7f35b267c37bad6/engine/graph/state.go#L337

purpleidea avatar Nov 26 '20 19:11 purpleidea

In an attempt to reproduce this, I changed the datetime.now() function tick every nanosecond, just to make the thing really spin fast. This patch does that

diff --git a/lang/funcs/core/datetime/now_fact.go b/lang/funcs/core/datetime/now_fact.go
index ade6b47..082e7f6 100644
--- a/lang/funcs/core/datetime/now_fact.go
+++ b/lang/funcs/core/datetime/now_fact.go
@@ -63,7 +63,7 @@ func (obj *DateTimeFact) Stream() error {
        // we set this up so it doesn't tick as often? -- Yes (make this a function or create a limit function to wrap this)
        // 3) is it best to have a delta timer that wakes up before it's needed
        // and calculates how much longer to sleep for?
-       ticker := time.NewTicker(time.Duration(1) * time.Second)
+       ticker := time.NewTicker(time.Nanosecond)

        // streams must generate an initial event on startup
        startChan := make(chan struct{}) // start signal
@@ -81,7 +81,7 @@ func (obj *DateTimeFact) Stream() error {

                select {
                case obj.init.Output <- &types.IntValue{ // seconds since 1970...
-                       V: time.Now().Unix(), // .UTC() not necessary
+                       V: time.Now().UnixNano(), // .UTC() not necessary
                }:
                case <-obj.closeChan:
                        return nil

It also looks like the vumeter code is broken, so this prevents that from throwing a segfault

diff --git a/lang/funcs/core/example/vumeter_func.go b/lang/funcs/core/example/vumeter_func.go
index 4fafa86..33f6dd4 100644
--- a/lang/funcs/core/example/vumeter_func.go
+++ b/lang/funcs/core/example/vumeter_func.go
@@ -110,7 +110,10 @@ func (obj *VUMeterFunc) Stream() error {
                        obj.multiplier = input.Struct()["multiplier"].Int()
                        obj.peak = input.Struct()["peak"].Float()

-               //case <-ticker.C: // received the timer event
+                       // select again to follow goChan to set obj.Result
+                       // otherwise we try do nil-deref the result and segfault
+                       continue
+
                case <-goChan:

                        if obj.last == nil {

Finally, I also added these in the hopes that these uncaught errors would expose any pause failures from within. I never saw these catch, but if they do it would indicate that there is possibly an issue where on one cycle the graph is left partially paused, so the already paused resources remain paused and cause the double close() on the next iteration of the engine:

diff --git a/lib/main.go b/lib/main.go
index f01131f..1ea3b6a 100644
--- a/lib/main.go
+++ b/lib/main.go
@@ -730,8 +730,13 @@ func (obj *Main) Run() error {
                        // we need the vertices to be paused to work on them, so
                        // run graph vertex LOCK...
                        if started { // TODO: we can flatten this check out I think
-                               converger.Pause()       // FIXME: add sync wait?
-                               obj.ge.Pause(fastPause) // sync
+                               // these should never fail, so if they do something is wrong
+                               if err := converger.Pause(); err != nil { // FIXME: add sync wait?
+                                       panic(fmt.Sprintf("failed to pause converger: %s", err))
+                               }
+                               if err := obj.ge.Pause(fastPause); err != nil { // sync
+                                       panic(fmt.Sprintf("failed to pause graph engine: %s", err))
+                               }
                                started = false
                        }

I couldn't reproduce this crash, but I don't think that means it's gone. It could possibly take a slower or a faster computer to actually produce it again

frebib avatar Nov 26 '20 21:11 frebib

So I finally worked out why this happens and what causes it. It's not super obvious and there is a large amount of indirection through the layers of the engine in mgmt; I'll do my best to explain it.

The simple answer is that the engine tries to swap graphs too frequently. Note from the original report of this issue there are several instances of the log lines generating new graph:

15:51:12 gapi: generating new graph...
15:51:19 gapi: generating new graph...
15:51:19 gapi: generating new graph...
15:51:19 main: commit...
15:51:19 engine: graph sync...
15:51:19 gapi: generating new graph...
15:51:19 main: commit...
15:51:19 engine: graph sync...

That log entry originates here and is printed every time the streamChan receives a message from the graph to indicate that something has changed https://github.com/purpleidea/mgmt/blob/0c2895701626b5d55dc645e475378bf8a6882e19/lang/gapi.go#L535 (I'm not entirely sure why this exists or what it's for, though. I think possibly as explained here) https://github.com/purpleidea/mgmt/blob/0c2895701626b5d55dc645e475378bf8a6882e19/lang/gapi.go#L580-L581

The part that's confusing to me is that the pause/swap/resume is in a synchronous loop and no external factors should be able to affect it. I still think the pause/resume implementation is likely to blame. It doesn't appear to lock the pause/resume enough and I think the race could be between the pause call, and the paused = true line. Ideally it should be an atomic operation locked with a mutex

frebib avatar May 04 '21 17:05 frebib

Hi folks! Sorry for not replying here in a while. Here's some backstory!

My golang concurrency skills were probably not as good back then when I wrote this code as they are now (I hope). So I don't think I had enough brain to fix this properly. Also I wasn't super motivated because I was blocked on not having finished the compiler.

What with that getting close, I took a look at the resource engine again. Long story short, I can't seem to make it panic anymore. This is good that it's not panic-ing, but bad because I'm worried there's a sneaky panic still hidden somewhere.

I looked through all the git logs that touched the engine since 0.0.21 as reported, and a few things were definitely fixed, but nothing that I think manifested like this panic did.

Secondly, I took some time and read through most of the engine code. The good news is that I understand most of it well again, and I'm actually pretty happy with it. I was worried a rewrite would have been necessary. I did identify a few "yucky" areas, and I've since pushed some patches (now in git master) to fix those.

There are still some areas I'd like to improve on, but nothing major. The only real thing I need to look deeper into is the error paths. For example, it's plausible that this sort of panic only occurs during a rare error scenario of some kind. I might need to make a resource which errors at a random time to find these, for example. Most error scenarios should be recoverable in some way (a retry might happen, or a graph switch might happen as a result, or so on) and so it shouldn't mean a full shutdown or a panic a lot of the time, so we care about these codepaths, they just aren't perhaps tested enough yet.

Long story short: if you have any clever stress tests you'd like to try out, please give this a whirl on git master. And if anything crashes, please save the full trace and the exact git sha1 and mcl code that caused it. If you added printf's or anything, then commit all of that and let me know, so that I can read the precise trace correctly and debug this.

Thanks!

purpleidea avatar Aug 31 '23 06:08 purpleidea