sensu-go icon indicating copy to clipboard operation
sensu-go copied to clipboard

Sensu Go WebUI Randomly Crashing

Open tgdfool2 opened this issue 4 years ago • 28 comments

After some (random) time, the Sensu Go WebUI crashes until the entire sensu-backend service gets restarted.

(As follow-up for https://github.com/sensu/sensu-go/issues/4070)

Expected Behavior

Sensu Go WebUI should be reliable and not randomly crash.

Current Behavior

Sensu Go WebUI crashes; no process listening anymore to port TCP/3000.

Steps to Reproduce (for bugs)

  1. Start sensu-backend: systemctl start sensu-backend
  2. Try to access the WebUI: success :-)
  3. Wait a bit (typically 1-2 hours)
  4. Try to access the WebUI again: failure :-(

Context

Please find attached to this issue a log file containing entries corresponding to the time the WebUI crashed on my system.

Your Environment

  • Sensu version used (sensuctl, sensu-backend, and/or sensu-agent):
ii  sensu-go-agent                6.2.0-3888                        amd64        Sensu Go Agent
ii  sensu-go-backend              6.2.0-3888                        amd64        Sensu Go Backend
  • Installation method (packages, binaries, docker etc.): Packages (deb)
  • Operating System and version (e.g. Ubuntu 14.04): Ubuntu 20.04
  • Virtualization type: Linux Container running on Proxmox Linux sensu-test 5.4.78-2-pve #1 SMP PVE 5.4.78-2 (Thu, 03 Dec 2020 14:26:17 +0100) x86_64 x86_64 x86_64 GNU/Linux

If you need additional information, feel free to ask :-)

tgdfool2 avatar Dec 18 '20 19:12 tgdfool2

sensu-backend.log

tgdfool2 avatar Dec 18 '20 19:12 tgdfool2

@tgdfool2 It appears from the logs that your cluster/node may be under very heavy load, for instance: Dec 18 18:12:29 sensu-test sensu-backend[11640]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/metricsd/backends/items/ab5d87e0-45f9-42fd-a4b6-dfa2027164ad\\\" \" with result \"range_response_count:1 size:108\" took too long (10.004053999s) to execute","pkg":"etcdserver","time":"2020-12-18T18:12:29+01:00"} it appears as if it took 10 seconds to run a range request. I suspect it may causing your backend to descend into a unending boot loop.

Could you provide more details on your deployment? are you using embedded etcd? which datastore are you using etcd or postgresql?

jamesdphillips avatar Dec 19 '20 00:12 jamesdphillips

Thanks for your answer @jamesdphillips.

Sensu Go is currently running with an embedded etcd:

root@sensu-test:~# ss -tulnp | grep sensu
udp     UNCONN   0        0              127.0.0.1:8125          0.0.0.0:*       users:(("sensu-agent",pid=152,fd=10))
udp     UNCONN   0        0              127.0.0.1:3030          0.0.0.0:*       users:(("sensu-agent",pid=152,fd=8))
tcp     LISTEN   0        4096           127.0.0.1:2379          0.0.0.0:*       users:(("sensu-backend",pid=11640,fd=6))
tcp     LISTEN   0        4096           127.0.0.1:2380          0.0.0.0:*       users:(("sensu-backend",pid=11640,fd=5))
tcp     LISTEN   0        4096           127.0.0.1:3030          0.0.0.0:*       users:(("sensu-agent",pid=152,fd=9))
tcp     LISTEN   0        4096           127.0.0.1:3031          0.0.0.0:*       users:(("sensu-agent",pid=152,fd=11))
tcp     LISTEN   0        4096                   *:8080                *:*       users:(("sensu-backend",pid=11640,fd=17))
tcp     LISTEN   0        4096                   *:8081                *:*       users:(("sensu-backend",pid=11640,fd=16))

This is a test server dedicated for Sensu Go that is running with 4 Go of RAM:

root@sensu-test:~# free -m
              total        used        free      shared  buff/cache   available
Mem:           4096         205        3255           0         635        3890
Swap:           512           0         511

I would be surprised if the etcd cluster was under heavy load, as it is currently only used for the Sensu Go deployment, and only one agent is registered (locally installed on the server).

tgdfool2 avatar Dec 19 '20 07:12 tgdfool2

Please find attached the entire syslog log of yesterday, filtered on "etcd". etcd.log

tgdfool2 avatar Dec 19 '20 07:12 tgdfool2

Hi, I'm experiencing the same behavior on a single node docker installation (tested with 6.1.3 and 6.2). This issue has started almost 20 days ago. No big changes have been done in the last month to checks (approx 20), entities (4) and such. When the WebUI goes down, one second before, I can see messages in the docker logs almost identical to the one posted from @jamesdphillips saying etcd took too long to execute. Not any other warning or error message appears. Moreover when WebUi goes down also WebSocket connections are lost so check schedules run - I see them in the backend logs- but agents get disconnected - I see it in agent's logs - and don't actually run checks. The only way I can recover the situation is to restart docker container. I also tried to destroy the container storage to reset the deployment but, after a few minutes, the problem presented again. The frequency is also increasing, now it last at most 10 minutes before crashing. The docker container runs on a 4core, 4GB ubuntu 16.04 machine. When WebUi crashes I can see the error: "Something went wrong The application has encountered an unrecoverable error. Based on the nature of the error you may need need to clear state before continuing. 502 web revision 7571e9d" Please advise or let me know if I can produce any useful info. Thanks

neuwolferl avatar Jan 12 '21 22:01 neuwolferl

Hi @neuwolferl,

Looking at @cwjohnston's mention above, it looks like our etcd cluster (and maybe yours as well) is running on disks that are not performant enough. That's why the WebUI randomly crashed. Unfortunately, I don't have any other hardware to try it...

Maybe @cwjohnston could give some more information here?

Thanks!

tgdfool2 avatar Jan 13 '21 06:01 tgdfool2

Hi, a little update. This morning I tried to downgrade the backend. Keeping the exact same configuration and iron, changing only the docker tag to 5.21.3 and dropping docker volume before restart. Now I still see log messages like: 2021-01-13T08:57:31.426823146Z {"component":"etcd","level":"warning","msg":"read-only range request "key:\"/sensu.io/assets/default/\" range_end:\"/sensu.io/assets/default0\" serializable:true " with result "range_response_count:0 size:5" took too long (169.29492ms) to execute","pkg":"etcdserver","time":"2021-01-13T09:57:31+01:00"} happening randomly circa once every 15 seconds, but WebUi does not crash anymore.

During circa half an hour WebUI gave me an error two times. In this case, some more data were present: "An error was encountered Please submit an issue to help resolve this problem. Reloading the page may be required to recover. ExtendableError@https://[...OMISSIS...]/static/js/app_9361.js:2:1773927 _createSuperInternal@https://[...OMISSIS...]/static/js/app_9361.js:2:11304 FetchError@https://[...OMISSIS...]/static/js/app_9361.js:2:12101 _createSuperInternal@https://[...OMISSIS...]/static/js/app_9361.js:2:11304 ServerError@https://[...OMISSIS...]/static/js/app_9361.js:2:12697 customFetch/<@https://[...OMISSIS...]static/js/app_9361.js:2:1662924 web revision b6d7442" After pressing reload I didn't have the 502 as per version 6.* but I was taken back to the webui.

neuwolferl avatar Jan 13 '21 09:01 neuwolferl

Have the exactly same issue as described here, the only log entry is about etcd taking too long.

Feb 12 09:55:36 sensu-go sensu-backend[1484]: {"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/rings/default/<blablabla>" \" with result \"range_response_count:1 size:89\" took too long (123.464996ms) to execute","pkg":"etcdserver","time":"2021-02-12T09:55:36+01:00"}

So I've switched to postgresql

Feb 12 09:50:04 sensu-go sensu-backend[1484]: {"component":"store-providers","level":"warning","msg":"trying to enable postgres event store","time":"2021-02-12T09:50:04+01:00"}
Feb 12 09:50:04 sensu-go sensu-backend[1484]: {"component":"store-providers","level":"warning","msg":"switched event store to postgres","time":"2021-02-12T09:50:04+01:00"}

And issue still persist. It feels like it was running longer (2 days), but can't tell for sure.

I have compiled the open source GUI version https://github.com/sensu/web/ and it works fine, even when the embedded one is already crashed. So the API is still working.

Not entirely sure this issue is caused by etcd performance, we have only around 70 hosts and not many checks on them, usually scheduled for 1-10 minutes, so no heavy load.

We can live without commercial features in GUI, but we can't rely on monitoring which is "somehow half broken"

VeselaHouba avatar Feb 12 '21 08:02 VeselaHouba

It is funny. I was just woken up by an alert about our sensu-ui going down, and I am seeing the same issue with etcd in the logs. It happened at 3:04am CST. @VeselaHouba What time zone are you in? I find it funny you just posted 25 minutes ago about this.

We have 54 hosts and 230 events.

Of course simply restarting the backend with systemctl restart sensu-backend.service fixed it.

Versions: ii sensu-go-backend 6.2.3-3986 amd64 Sensu Go Backend Ubuntu 20.04.2 LTS

edgan avatar Feb 12 '21 09:02 edgan

@edgan CET (+01:00). Coincidence I guess, this happens few times a day for at least month. Fortunately slack notifications from sensu still works, so we're not running completely blind.

VeselaHouba avatar Feb 12 '21 09:02 VeselaHouba

I just ran apt-get dist-upgrade and got the version below. Which is interesting, since I don't see anything in the changelog for anything past 6.2.3. 6.2.3 is the last official release, but 6.2.4 and 6.2.5 have tags.

ii sensu-go-backend 6.2.5-4040 amd64 Sensu Go Backend

Later today I am going to try switching the instance from gp2 to io2. I will also have to give more thought to gp3 moving forward. The io2 block express is in preview, and seems to be the pinacle.

https://aws.amazon.com/ebs/features/

edgan avatar Feb 12 '21 09:02 edgan

Running 6.2.5 for few hours, already crashed. Running on Hetzner SSD VM 4GB ram, ~1.5G used, system load under 0.3 Also running standalone open-source version of web-ui for few days now with no issue, which would support idea that issue is only with the embedded web-ui component.

VeselaHouba avatar Feb 12 '21 22:02 VeselaHouba

I'm seeing the exact same issues. Opened a thread on discourse about it: https://discourse.sensu.io/t/sensu-ui-going-offline/2451

wduncanfraser avatar Feb 17 '21 21:02 wduncanfraser

This issue has been mentioned on Sensu Community. There might be relevant details there:

https://discourse.sensu.io/t/sensu-ui-going-offline/2451/2

sensu-discourse avatar Feb 17 '21 21:02 sensu-discourse

Attaching logs from debugging: Added in monitoring of the UI HTTP interface, and two of the servers dropped last night. One at 0700 UTC, and the other at 0739 UTC.

Logs for crash at 0700: sudo journalctl --unit sensu-backend --since “2021-02-18 06:30” --until “2021-02-18 07:30” | cat https://gist.github.com/wduncanfraser/c55691024df1bf5d2e6d485790fd766a#file-0700_crash-txt

Logs for crash at 0739: $ sudo journalctl --unit sensu-backend --since “2021-02-18 07:00” --until “2021-02-18 08:00” | cat https://gist.github.com/wduncanfraser/c55691024df1bf5d2e6d485790fd766a#file-0739_crash-txt

wduncanfraser avatar Feb 19 '21 03:02 wduncanfraser

Hi folks, I wanted to let you all know that we've found an issue that is likely linked to what you've been experiencing.

The issue is in the commercial distribution of the software and doesn't impact the OSS project, so unfortunately I don't have an issue to reference here. I'll try to explain what can sometimes happen.

When Sensu's etcd client times out after several retries, it propagates an error to the various services that sensu-backend provides. Since the error is unrecoverable, the backend tells all of the services to stop and start again, with a new etcd client. This happens whether or not etcd is running embedded within sensu-backend, or if it's being run externally, with sensu-backend as a client only (--no-embed-etcd).

Unfortunately, we found a bug in one of the services that causes it to hang on shutdown. This leads to sensu-backend running only partially, and being mostly broken. The most obvious impact is the web UI no longer running, but other services will have failed to start up as well.

We've fixed this issue and it will be resolved in 6.3.0, but those of you encountering frequent crashes will still need to do some things to avoid having problems; a non-functioning etcd is going to be an issue even if this bug is resolved.

If you are running etcd on network attached storage, then you need to increase the heartbeat timeout.

Due to its design, etcd is extremely sensitive to I/O latency, and its defaults are written assuming that it will be deployed to locally attached low-latency storage. By default, etcd's heartbeat timeout is only 1 second. This is why etcd issues warnings for reads that take longer than 100 milliseconds.

Even small installations should deploy to locally attached SSDs, but if you really want to deploy to network attached storage, you can improve stability by increasing the hearbeat interval and heartbeat timeout. I would suggest increasing the heartbeat interval to 1 second, and the timeout to 10 seconds. Note that even when you increase heartbeat interval and heartbeat timeout, etcd will erroneously warn about reads that are longer than 100ms; this can be considered normal.

Increasing the heartbeat timeout does come at a cost. When your etcd cluster experiences an election, it will take longer for service to be restored than with tighter timeouts. So be mindful when adjusting the timeout, that this number represents the minimum amount of time it will take the cluster to restore service after an election.

These issues can still occur even if you have a single node etcd cluster. A single node etcd cluster can still experience elections.

You can read more about tuning etcd here, including guidelines for setting the heartbeat timeout: https://etcd.io/docs/v3.4.0/tuning/

I genuinely hope this helps those of you that have been struggling with stability. We do hope to reduce our reliance on etcd going forward, as it can be quite finicky about its deployment environment. When deployed and tuned well, it can be quite reliable and resilient, but it can be somewhat challenging!

echlebek avatar Mar 08 '21 19:03 echlebek

This issue has been mentioned on Sensu Community. There might be relevant details there:

https://discourse.sensu.io/t/sensu-webui-offline-restarting-service-solves-the-problem/2352/22

sensu-discourse avatar Mar 09 '21 18:03 sensu-discourse

Ok this it great but while I'm waiting on 6.3.0 image but how does one tune etcd for the 6.2.2. docker image using the built-in etcd? There is no command line to add extra options to the etcd invocation and setting the ETCD_HEATBEAT_INTERVAL & ETCD_ELECTION_TIMEOUT when starting the container has no effect. I'm still getting {"component":"etcd","level":"warning","msg":"request \"header:\u003cID:16595615202129641869 \u003e compaction:\u003crevision:265665 \u003e \" with result \"size:6\" took too long (331.674246ms) to execute","pkg":"etcdserver","time":"2021-03-09T23:45:02Z"} with

ETCD_HEARTBEAT_INTERVAL=1250
ETCD_ELECTION_TIMEOUT=5000

andrew-pickin-epi avatar Mar 09 '21 23:03 andrew-pickin-epi

That would be

SENSU_BACKEND_ETCD_HEARTBEAT_INTERVAL=1250
SENSU_BACKEND_ETCD_ELECTION_TIMEOUT=5000

echlebek avatar Mar 09 '21 23:03 echlebek

Thanks for that. That appears to be registering as I get

{"component":"sensu-enterprise","error":"error starting etcd: --election-timeout[5000ms] should be at least as 5 times as --heartbeat-interval[1250ms]","level":"fatal","msg":"error executing sensu-backend","time":"2021-03-10T00:28:53Z"}

Setting the heartbeat down to 1000, gets past this error, but doesn't clear up the warnings.

{"component":"etcd","level":"warning","msg":"read-only range request \"key:\\\"/sensu.io/handlers/default/keepalive\\\" limit:1 \" with result \"range_response_count:0 size:6\" took too long (542.642467ms) to execute","pkg":"etcdserver","time":"2021-03-10T00:31:49Z"}

andrew-pickin-epi avatar Mar 10 '21 00:03 andrew-pickin-epi

etcd will warn about slow reads no matter what you set the election timeout to.

echlebek avatar Mar 10 '21 20:03 echlebek

One of the reason that the system is under stress is the excessive log reports.

andrew-pickin-epi avatar Mar 11 '21 10:03 andrew-pickin-epi

That is pretty unlikely unless you're running at debug level, but if you really think so, you can always adjust the log level.

echlebek avatar Mar 11 '21 19:03 echlebek

No but it contribute to a problem and hampers other diagnostics. I don't want to knock debug level down to remove warnings, unless there is a way to do this for just etcd.

andrew-pickin-epi avatar Mar 12 '21 09:03 andrew-pickin-epi

My changes in backend.yml

etcd-election-timeout: '25000'
etcd-heartbeat-interval: '5000'
log-level: error

Sensu backend didn't crash for 2 days. 🤞🏻

VeselaHouba avatar Mar 13 '21 10:03 VeselaHouba

@VeselaHouba it's been a few more days now, have you seen better stability from your Sensu cluster after making the changes?

echlebek avatar Mar 17 '21 00:03 echlebek

@echlebek So far so good

root@sensu-go:~# date
Wed Mar 17 08:54:43 CET 2021
root@sensu-go:~# ps -ef|grep sensu
sensu    20304     1 22 Mar11 ?        1-07:22:50 /usr/sbin/sensu-backend start -c /etc/sensu/backend.yml

Don't want to celebrate too soon, but looks like your suggestion helped. Thanks!

VeselaHouba avatar Mar 17 '21 07:03 VeselaHouba

This issue has been mentioned on Sensu Community. There might be relevant details there:

https://discourse.sensu.io/t/sensu-ui-going-offline/2451/10

sensu-discourse avatar Jun 22 '23 21:06 sensu-discourse