wag icon indicating copy to clipboard operation
wag copied to clipboard

Wag service stopping - Node entered state: dead

Open Mariaab1 opened this issue 10 months ago • 7 comments

Hello again!

I'm running Version: v8.0.1, and a few days ago, the service stopped.

When we checked the logs, it said that the node had died. However, we also noticed that the disk was at 100% usage. So, we thought it was a resource shortage issue on our end. We cleaned up, restarted the service, and it was working fine.

But today, the service stopped again and this time, we do have enough disk space 🤔

This is today's log output:

Feb 27 13:56:37 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:37.955945+0100","caller":"traceutil/trace.go:171","msg":"trace[1655639491] put","detail":"{key:wag/node/229e0fb671e145bf/ping; req_size:65; response_revision:345045; }","duration":"102.626463ms","start":"2025-02-27T13:56:37.853298+0100","end":"2025-02-27T13:56:37.955924+0100","steps":["trace[1655639491] 'process raft request'  (duration: 102.544014ms)"],"step_count":1}
Feb 27 13:56:42 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:42.982961+0100","caller":"traceutil/trace.go:171","msg":"trace[2013250737] put","detail":"{key:wag/node/229e0fb671e145bf/ping; req_size:65; response_revision:345046; }","duration":"130.014788ms","start":"2025-02-27T13:56:42.852929+0100","end":"2025-02-27T13:56:42.982944+0100","steps":["trace[2013250737] 'process raft request'  (duration: 129.943911ms)"],"step_count":1}
Feb 27 13:56:53 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:53.818435+0100","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5025899804084671520,"retry-timeout":"500ms"}
Feb 27 13:56:54 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:54.318596+0100","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5025899804084671520,"retry-timeout":"500ms"}
Feb 27 13:56:54 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:54.818975+0100","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5025899804084671520,"retry-timeout":"500ms"}
Feb 27 13:56:55 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:55.104476+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-02-27T13:56:52.852773+0100","time spent":"2.251662469s","remote":"@","response type":"/etcdserverpb.KV/Put","request count":1,"request size":65,"response count":0,"response size":0,"request content":"key:\"wag/node/229e0fb671e145bf/ping\" value_size:31 "}
Feb 27 13:56:55 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:55.325400+0100","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5025899804084671520,"retry-timeout":"500ms"}
Feb 27 13:56:55 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:55.104587+0100","caller":"wal/wal.go:805","msg":"slow fdatasync","took":"2.251717265s","expected-duration":"1s"}
Feb 27 13:56:55 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:55.801016+0100","caller":"traceutil/trace.go:171","msg":"trace[1349514102] linearizableReadLoop","detail":"{readStateIndex:345837; appliedIndex:345836; }","duration":"2.482634617s","start":"2025-02-27T13:56:53.318351+0100","end":"2025-02-27T13:56:55.800985+0100","steps":["trace[1349514102] 'read index received'  (duration: 2.482537879s)","trace[1349514102] 'applied index is now lower than readState.Index'  (duration: 95.55µs)"],"step_count":2}
Feb 27 13:56:55 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:55.801142+0100","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"2.48276803s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"devices-\" range_end:\"devices.\" sort_order:DESCEND ","response":"range_response_count:6 size:2878"}
Feb 27 13:56:55 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:55.801169+0100","caller":"traceutil/trace.go:171","msg":"trace[209911491] range","detail":"{range_begin:devices-; range_end:devices.; response_count:6; response_revision:345048; }","duration":"2.482844073s","start":"2025-02-27T13:56:53.318318+0100","end":"2025-02-27T13:56:55.801162+0100","steps":["trace[209911491] 'agreement among raft nodes before linearized reading'  (duration: 2.482729106s)"],"step_count":1}
Feb 27 13:56:55 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:55.801194+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-02-27T13:56:53.318302+0100","time spent":"2.482885162s","remote":"@","response type":"/etcdserverpb.KV/Range","request count":0,"request size":22,"response count":6,"response size":2899,"request content":"key:\"devices-\" range_end:\"devices.\" sort_order:DESCEND "}
Feb 27 13:56:55 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:55.045798+0100","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004763c0//tmp/wag.etcd.c381c917564213849e1a","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 unable to write liveness value
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 node entered state:  dead
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Tearing down node
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Removing wireguard device
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Wireguard device removed
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Removing Firewall rules...
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Removing Firewall rules...
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Firewall rules removed.
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Stopped Management UI
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Removing wireguard device
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Unable to remove wireguard device, delete failed:  failed to execute message: netlink receive: no such device
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 remove /tmp/wag.sock: no such file or directory
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Stopped Management UI
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Creating redirection from 80/tcp to TLS webserver public listener failed: http: Server closed
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Stopped MFA portal
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Stopped MFA portal
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 Tear down complete
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 etcd server was instructed to terminate
Feb 27 13:56:56 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:56.410868+0100","caller":"embed/etcd.go:377","msg":"closing etcd server","name":"default","data-dir":"default.wag-node.etcd","advertise-peer-urls":["https://localhost:2380"],"advertise-client-urls":["http://localhost:2379"]}
Feb 27 13:56:56 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:56.417179+0100","caller":"embed/serve.go:160","msg":"stopping insecure grpc server due to error","error":"accept unix /tmp/wag.etcd.c381c917564213849e1a: use of closed network connection"}
Feb 27 13:56:56 vpn-clonada wag[133]: {"level":"warn","ts":"2025-02-27T13:56:56.422663+0100","caller":"embed/serve.go:162","msg":"stopped insecure grpc server due to error","error":"accept unix /tmp/wag.etcd.c381c917564213849e1a: use of closed network connection"}
Feb 27 13:56:56 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:56.423756+0100","caller":"etcdserver/server.go:1521","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"229e0fb671e145bf","current-leader-member-id":"229e0fb671e145bf"}
Feb 27 13:56:56 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:56.451412+0100","caller":"embed/etcd.go:581","msg":"stopping serving peer traffic","address":"127.0.0.1:2380"}
Feb 27 13:56:56 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:56.451526+0100","caller":"embed/etcd.go:586","msg":"stopped serving peer traffic","address":"127.0.0.1:2380"}
Feb 27 13:56:56 vpn-clonada wag[133]: {"level":"info","ts":"2025-02-27T13:56:56.451540+0100","caller":"embed/etcd.go:379","msg":"closed etcd server","name":"default","data-dir":"default.wag-node.etcd","advertise-peer-urls":["https://localhost:2380"],"advertise-client-urls":["http://localhost:2379"]}
Feb 27 13:56:56 vpn-clonada wag[133]: 2025/02/27 13:56:56 webserver tunnel listener failed: http: Server closed
Feb 27 13:56:56 vpn-clonada systemd[1]: wag.service: Main process exited, code=exited, status=1/FAILURE
Feb 27 13:56:56 vpn-clonada systemd[1]: wag.service: Failed with result 'exit-code'.
Feb 27 13:56:56 vpn-clonada systemd[1]: wag.service: Consumed 10min 16.011s CPU time.```

Thank you in advance 🐧🌼

Mariaab1 avatar Feb 27 '25 14:02 Mariaab1

Hi there!

Thanks for opening this issue, this is something I've seen exactly once and am unable to replicate it but know its a bug. I have frankly no idea why its occurring or how to stop it. But I'll mark this as a bug so it stays on my mind.

NHAS avatar Feb 27 '25 20:02 NHAS

If you still have more logs from your service, could you see if all this was preceded by a snapshot?

Also can you check how big your WAL files are?

NHAS avatar Feb 27 '25 20:02 NHAS

Ah I've found part of the problem for use case at least, there is a test cluster function that effectively runs every 5 seconds, and has a 2 second timeout. Due to your extremely slow latencies, that keepalive was tripped causing the node cluster to reset.

I'll bump it up to something more appropriate

NHAS avatar Feb 27 '25 21:02 NHAS

Hello! Sorry for the wait.

Yes, I do have logs. Actually, it happened again, here are the latest ones:

Mar 01 09:26:05 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T09:26:05.071269+0100","caller":"traceutil/trace.go:171","msg":"trace[1138984415] put","detail":"{key:wag/node/229e0fb671e145bf/ping; req_size:65; response_revision:376325; }","duration":"262.133301ms","start":"2025-03-01T09:26:04.809117+0100","end":"2025-03-01T09:26:05.071250+0100","steps":["trace[1138984415] 'process raft request'  (duration: 262.064413ms)"],"step_count":1}
Mar 01 09:26:29 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T09:26:29.964082+0100","caller":"traceutil/trace.go:171","msg":"trace[1950239778] put","detail":"{key:wag/node/229e0fb671e145bf/ping; req_size:65; response_revision:376330; }","duration":"154.573795ms","start":"2025-03-01T09:26:29.809489+0100","end":"2025-03-01T09:26:29.964063+0100","steps":["trace[1950239778] 'process raft request'  (duration: 154.497668ms)"],"step_count":1}
Mar 01 10:01:35 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:01:35.084556+0100","caller":"v3compactor/periodic.go:134","msg":"starting auto periodic compaction","revision":375959,"compact-period":"1h0m0s"}
Mar 01 10:01:35 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:01:35.101618+0100","caller":"v3compactor/periodic.go:142","msg":"completed auto periodic compaction","revision":375959,"compact-period":"1h0m0s","took":"17.009786ms"}
Mar 01 10:01:35 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:01:35.101667+0100","caller":"mvcc/index.go:214","msg":"compact tree index","revision":375959}
Mar 01 10:01:35 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:01:35.113811+0100","caller":"mvcc/kvstore_compaction.go:69","msg":"finished scheduled compaction","compact-revision":375959,"took":"12.070054ms","hash":3760640311,"current-db-size-bytes":368640,"current-db-size":"369 kB","current-db-size-in-use-bytes":266240,"current-db-size-in-use":"266 kB"}
Mar 01 10:01:35 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:01:35.114097+0100","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":3760640311,"revision":375959,"compact-revision":375239}
Mar 01 10:02:35 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:02:35.306170+0100","caller":"traceutil/trace.go:171","msg":"trace[1960669249] linearizableReadLoop","detail":"{readStateIndex:377632; appliedIndex:377631; }","duration":"142.23302ms","start":"2025-03-01T10:02:35.163920+0100","end":"2025-03-01T10:02:35.306153+0100","steps":["trace[1960669249] 'read index received'  (duration: 142.141966ms)","trace[1960669249] 'applied index is now lower than readState.Index'  (duration: 90.339µs)"],"step_count":2}
Mar 01 10:02:35 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:02:35.306304+0100","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"142.364657ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"devices-\" range_end:\"devices.\" sort_order:DESCEND ","response":"range_response_count:6 size:2879"}
Mar 01 10:02:35 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:02:35.306336+0100","caller":"traceutil/trace.go:171","msg":"trace[2144930215] range","detail":"{range_begin:devices-; range_end:devices.; response_count:6; response_revision:376763; }","duration":"142.435487ms","start":"2025-03-01T10:02:35.163893+0100","end":"2025-03-01T10:02:35.306329+0100","steps":["trace[2144930215] 'agreement among raft nodes before linearized reading'  (duration: 142.32748ms)"],"step_count":1}
Mar 01 10:02:35 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:02:35.306426+0100","caller":"traceutil/trace.go:171","msg":"trace[999749114] put","detail":"{key:wag/node/229e0fb671e145bf/ping; req_size:65; response_revision:376763; }","duration":"497.481327ms","start":"2025-03-01T10:02:34.808934+0100","end":"2025-03-01T10:02:35.306415+0100","steps":["trace[999749114] 'process raft request'  (duration: 497.163701ms)"],"step_count":1}
Mar 01 10:02:35 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:02:35.306467+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-03-01T10:02:34.808920+0100","time spent":"497.531813ms","remote":"@","response type":"/etcdserverpb.KV/Put","request count":1,"request size":65,"response count":0,"response size":27,"request content":"key:\"wag/node/229e0fb671e145bf/ping\" value_size:31 "}
Mar 01 10:47:10 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:47:10.106181+0100","caller":"traceutil/trace.go:171","msg":"trace[486330193] linearizableReadLoop","detail":"{readStateIndex:378167; appliedIndex:378166; }","duration":"106.56496ms","start":"2025-03-01T10:47:09.999600+0100","end":"2025-03-01T10:47:10.106165+0100","steps":["trace[486330193] 'read index received'  (duration: 106.455065ms)","trace[486330193] 'applied index is now lower than readState.Index'  (duration: 109.163µs)"],"step_count":2}
Mar 01 10:47:10 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:47:10.106328+0100","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"106.706989ms","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"devices-\" range_end:\"devices.\" sort_order:DESCEND ","response":"range_response_count:6 size:2879"}
Mar 01 10:47:10 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:47:10.106358+0100","caller":"traceutil/trace.go:171","msg":"trace[648856435] range","detail":"{range_begin:devices-; range_end:devices.; response_count:6; response_revision:377298; }","duration":"106.7698ms","start":"2025-03-01T10:47:09.999579+0100","end":"2025-03-01T10:47:10.106348+0100","steps":["trace[648856435] 'agreement among raft nodes before linearized reading'  (duration: 106.662193ms)"],"step_count":1}
Mar 01 10:47:10 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:47:10.106463+0100","caller":"traceutil/trace.go:171","msg":"trace[636348363] put","detail":"{key:wag/node/229e0fb671e145bf/ping; req_size:65; response_revision:377298; }","duration":"297.680285ms","start":"2025-03-01T10:47:09.808773+0100","end":"2025-03-01T10:47:10.106453+0100","steps":["trace[636348363] 'process raft request'  (duration: 297.330315ms)"],"step_count":1}
Mar 01 10:52:50 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:50.458016+0100","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5025899846248398903,"retry-timeout":"500ms"}
Mar 01 10:52:50 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:50.958726+0100","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5025899846248398903,"retry-timeout":"500ms"}
Mar 01 10:52:51 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:51.458962+0100","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5025899846248398903,"retry-timeout":"500ms"}
Mar 01 10:52:51 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:51.809343+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-03-01T10:52:49.809236+0100","time spent":"2.000078398s","remote":"@","response type":"/etcdserverpb.KV/Put","request count":1,"request size":65,"response count":0,"response size":0,"request content":"key:\"wag/node/229e0fb671e145bf/ping\" value_size:31 "}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:52.079444+0100","caller":"etcdserver/v3_server.go:920","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":5025899846248398903,"retry-timeout":"500ms"}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:52.079526+0100","caller":"wal/wal.go:805","msg":"slow fdatasync","took":"2.270189424s","expected-duration":"1s"}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:51.809449+0100","logger":"etcd-client","caller":"[email protected]/retry_interceptor.go:63","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002d01e0//tmp/wag.etcd.cc78728e7fc429efedda","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 unable to write liveness value
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 node entered state:  dead
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Tearing down node
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Removing wireguard device
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:52:52.106195+0100","caller":"traceutil/trace.go:171","msg":"trace[391705599] linearizableReadLoop","detail":"{readStateIndex:378235; appliedIndex:378234; }","duration":"2.148658419s","start":"2025-03-01T10:52:49.957522+0100","end":"2025-03-01T10:52:52.106181+0100","steps":["trace[391705599] 'read index received'  (duration: 2.122280853s)","trace[391705599] 'applied index is now lower than readState.Index'  (duration: 26.376909ms)"],"step_count":2}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:52.106309+0100","caller":"etcdserver/util.go:170","msg":"apply request took too long","took":"2.148765606s","expected-duration":"100ms","prefix":"read-only range ","request":"key:\"devices-\" range_end:\"devices.\" sort_order:DESCEND ","response":"range_response_count:6 size:2879"}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:52:52.106333+0100","caller":"traceutil/trace.go:171","msg":"trace[284221767] range","detail":"{range_begin:devices-; range_end:devices.; response_count:6; response_revision:377366; }","duration":"2.148825374s","start":"2025-03-01T10:52:49.957501+0100","end":"2025-03-01T10:52:52.106326+0100","steps":["trace[284221767] 'agreement among raft nodes before linearized reading'  (duration: 2.148727473s)"],"step_count":1}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:52.106359+0100","caller":"v3rpc/interceptor.go:197","msg":"request stats","start time":"2025-03-01T10:52:49.957485+0100","time spent":"2.148865758s","remote":"@","response type":"/etcdserverpb.KV/Range","request count":0,"request size":22,"response count":6,"response size":2900,"request content":"key:\"devices-\" range_end:\"devices.\" sort_order:DESCEND "}
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Wireguard device removed
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Removing Firewall rules...
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Removing Firewall rules...
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Firewall rules removed.
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Stopped Management UI
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Removing wireguard device
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Unable to remove wireguard device, delete failed:  failed to execute message: netlink receive: no such device
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 remove /tmp/wag.sock: no such file or directory
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Stopped Management UI
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Creating redirection from 80/tcp to TLS webserver public listener failed: http: Server closed
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Stopped MFA portal
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Stopped MFA portal
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 Tear down complete
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 etcd server was instructed to terminate
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:52:52.497098+0100","caller":"embed/etcd.go:377","msg":"closing etcd server","name":"default","data-dir":"default.wag-node.etcd","advertise-peer-urls":["https://localhost:2380"],"advertise-client-urls":["http://localhost:2379"]}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:52.500361+0100","caller":"embed/serve.go:160","msg":"stopping insecure grpc server due to error","error":"accept unix /tmp/wag.etcd.cc78728e7fc429efedda: use of closed network connection"}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"warn","ts":"2025-03-01T10:52:52.502808+0100","caller":"embed/serve.go:162","msg":"stopped insecure grpc server due to error","error":"accept unix /tmp/wag.etcd.cc78728e7fc429efedda: use of closed network connection"}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:52:52.502983+0100","caller":"etcdserver/server.go:1521","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"229e0fb671e145bf","current-leader-member-id":"229e0fb671e145bf"}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:52:52.523984+0100","caller":"embed/etcd.go:581","msg":"stopping serving peer traffic","address":"127.0.0.1:2380"}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:52:52.524130+0100","caller":"embed/etcd.go:586","msg":"stopped serving peer traffic","address":"127.0.0.1:2380"}
Mar 01 10:52:52 vpn-clonada wag[133]: {"level":"info","ts":"2025-03-01T10:52:52.524154+0100","caller":"embed/etcd.go:379","msg":"closed etcd server","name":"default","data-dir":"default.wag-node.etcd","advertise-peer-urls":["https://localhost:2380"],"advertise-client-urls":["http://localhost:2379"]}
Mar 01 10:52:52 vpn-clonada wag[133]: 2025/03/01 10:52:52 webserver tunnel listener failed: http: Server closed
Mar 01 10:52:52 vpn-clonada systemd[1]: wag.service: Main process exited, code=exited, status=1/FAILURE
Mar 01 10:52:52 vpn-clonada systemd[1]: wag.service: Failed with result 'exit-code'.
Mar 01 10:52:52 vpn-clonada systemd[1]: wag.service: Consumed 9min 31.598s CPU time.

Regarding your question:

  • If you mean a snapshot of the server (it's a VM), no, there hasn't been one 🤔
  • If you're referring to something in the logs where the word "snapshot" appears:

journalctl -u wag --since "2025-03-01 00:00:00" --no-pager | grep -i snapshot

Mar 03 08:26:50 vpn-clonada wag[316222]: {"level":"info","ts":"2025-03-03T08:26:50.436858+0100","caller":"embed/etcd.go:310","msg":"starting an etcd server","etcd-version":"3.5.15","git-sha":"Not provided (use ./build instead of go build)","go-version":"go1.23.2","go-os":"linux","go-arch":"amd64","max-cpu-set":1,"max-cpu-available":1,"member-initialized":true,"name":"default","data-dir":"default.wag-node.etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"default.wag-node.etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":100000,"max-wals":5,"max-snapshots":5,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["https://localhost:2380"],"listen-peer-urls":["https://localhost:2380"],"advertise-client-urls":["http://localhost:2379"],"listen-client-urls":["unix:///tmp/wag.etcd.e4cecf01638be9c35401"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"","initial-cluster-state":"new","initial-cluster-token":"","quota-backend-bytes":2147483648,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"periodic","auto-compaction-retention":"1h0m0s","auto-compaction-interval":"1h0m0s","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"}
Mar 03 08:26:51 vpn-clonada wag[316222]: {"level":"info","ts":"2025-03-03T08:26:51.017270+0100","caller":"etcdserver/server.go:511","msg":"recovered v2 store from snapshot","snapshot-index":300003,"snapshot-size":"12 kB"}
Mar 03 08:26:51 vpn-clonada wag[316222]: {"level":"info","ts":"2025-03-03T08:26:51.019428+0100","caller":"etcdserver/server.go:524","msg":"recovered v3 backend from snapshot","backend-size-bytes":368640,"backend-size":"369 kB","backend-size-in-use-bytes":225280,"backend-size-in-use":"225 kB"}
root@vpn-clonada:~#

Although this is from today, the service remained off from the 1st until today.

Finally, here is the size of the WAL files:

Image

Let me know if you need anything else :)

Mariaab1 avatar Mar 03 '25 12:03 Mariaab1

Ah sweet, yeah you can see just before the issue hits etcd runs a compaction and then for some reason it takes >2 seconds to serve the request. So I wonder why thats happening, does this happen every hour by any chance? (AS the periodic compaction happens hourly)

NHAS avatar Mar 03 '25 22:03 NHAS

Unfortunately to get this fixed, I am going to need to complete work on my client ui changes. Which may take a while, apologise!

NHAS avatar Mar 03 '25 22:03 NHAS

Oh, okay, no worries 😌👌

And no, it doesn’t happen every hour. It was working fine for a few weeks, and it has happened 3 times.

The first time was on the 24th or 25th—I don’t remember exactly

The second time was on February 27th at 13:56

The third time was on Saturday, March 1st, at 10:52

Mariaab1 avatar Mar 04 '25 08:03 Mariaab1

Hi there, a new version of wag has been released with an updated version of etcd, 3.5 -> 3.6, I am not sure if this will fix your issue or not but the timeout for cluster health checking has been raised to 15 seconds for writing. This still indicates that your system might have issues with disk io. But for now I am closing this issue.

NHAS avatar Jun 15 '25 23:06 NHAS