helm-charts icon indicating copy to clipboard operation
helm-charts copied to clipboard

Grafana pod crashes after upgrade or restart

Open Subetov opened this issue 3 years ago • 11 comments

grafana chart version: 6.29.6 k8s vesrion: 1.21.4

Grafana pod crashes after upgrade or restart and never become online. Everything is fine with fresh install. But if I trigger restart or try to upgrade helm release - getting this error:

│ grafana panic: New alert rules created while using unified alerting will be deleted, set force_migration=true in your grafana.ini and try again if this is okay.                                         │
│ grafana                                                                                                                                                                                                  │
│ grafana goroutine 1 [running]:                                                                                                                                                                           │
│ grafana github.com/grafana/grafana/pkg/services/sqlstore/migrations/ualert.AddDashAlertMigration(0xc000c974f0)                                                                                           │
│ grafana     /drone/src/pkg/services/sqlstore/migrations/ualert/ualert.go:78 +0x797                                                                                                                       │
│ grafana github.com/grafana/grafana/pkg/services/sqlstore/migrations.(*OSSMigrations).AddMigration(0xc00028a620, 0xc000c974f0)                                                                            │
│ grafana     /drone/src/pkg/services/sqlstore/migrations/migrations.go:58 +0x205                                                                                                                          │
│ grafana github.com/grafana/grafana/pkg/services/sqlstore.(*SQLStore).Migrate(0xc000232300, 0x0)                                                                                                          │
│ grafana     /drone/src/pkg/services/sqlstore/sqlstore.go:135 +0x6f                                                                                                                                       │
│ grafana github.com/grafana/grafana/pkg/services/sqlstore.ProvideService(0xc000d00000, 0x18, {0x36e67e0, 0x5752738}, {0x370c300, 0xc000c970e0})                                                           │
│ grafana     /drone/src/pkg/services/sqlstore/sqlstore.go:67 +0xdc                                                                                                                                        │
│ grafana github.com/grafana/grafana/pkg/server.Initialize({{0x7ffc83eb2c4d, 0x18}, {0x7ffc83eb2c31, 0x12}, {0xc0001a8040, 0x5, 0x5}}, {{0x0, 0x0}, {0x0, ...}, ...}, ...)                                 │
│ grafana     /drone/src/pkg/server/wire_gen.go:147 +0x1b6                                                                                                                                                 │
│ grafana github.com/grafana/grafana/pkg/cmd/grafana-server/commands.executeServer({0x7ffc83eb2c4d, 0x18}, {0x7ffc83eb2c31, 0x12}, {0x0, 0x0}, {0x7ffc83eb2c72, 0x6}, 0x0, {{0x3667e20, ...}, ...})        │
│ grafana     /drone/src/pkg/cmd/grafana-server/commands/cli.go:170 +0x625                                                                                                                                 │
│ grafana github.com/grafana/grafana/pkg/cmd/grafana-server/commands.RunServer({{0x3667e20, 0x5}, {0x3669770, 0xa}, {0x3667e18, 0x4}, {0x3669760, 0xa}})                                                   │
│ grafana     /drone/src/pkg/cmd/grafana-server/commands/cli.go:107 +0x785                                                                                                                                 │
│ grafana main.main()                                                                                                                                                                                      │
│ grafana     /drone/src/pkg/cmd/grafana-server/main.go:16 +0xc5

Subetov avatar Jun 06 '22 15:06 Subetov

I have noticed behavior described above in all versions since 6.29.3 What is very strange - if you update sequentially from version 6.29.2 - everything is fine. Version 6.29.2 works fine, except as described below, restarting the pod is fine. And one more strange thing - version 6.29.2, installation from scratch. There are Prometheus alerts: image

But after I reboot the pod, the prometheus alerts disappear: image

Chart values:

  enabled: true
  plugins: []
  grafana.ini:
    server:
      domain: ""
      root_url: "%(protocol)s://%(domain)s/grafana"
      serve_from_sub_path: true
  dashboardProviders:
    dashboardproviders.yaml:
      apiVersion: 1
      providers:
        - name: qqq
          orgId: 1
          type: file
          options:
            path: /var/lib/grafana/dashboards/qqq
  dashboardsConfigMaps:
    qqq: "qqq-dashboards"
  datasources:
    datasources.yaml:
      apiVersion: 1
      datasources:
        - name: prometheus
          type: prometheus
          url: "http://{{ .Release.Name }}-prometheus-server"
        - name: loki
          type: loki
          url: "http://{{ .Release.Name }}-loki:3100"
          jsonData:
            manageAlerts: false

Subetov avatar Jun 07 '22 13:06 Subetov

Same issue here!

mindrunner avatar Jun 07 '22 15:06 mindrunner

Got the same problem on 8.5.4

magikfly avatar Jun 08 '22 06:06 magikfly

Same here

tbard1 avatar Jun 08 '22 07:06 tbard1

Downgrading to 8.2.6 seems to have fixed it by resetting alerting back in legacy mode.

magikfly avatar Jun 08 '22 07:06 magikfly

I'm not using this chart, so apologies if this is a bit off-topic, but perhaps this helps someone in similar situation. I hit the same issue with grafana panicking:

grafana panic: New alert rules created while using unified alerting will be deleted, set force_migration=true in your grafana.ini and try again if this is okay.

That specific environment was just running grafana/grafana:latest with some automatic image pull and update in place. Looking at current image, it contains Grafana v8.5.5. However looking at a list of previously used docker images, I've noticed that at some stage about 7 days ago grafana/grafana:latest (image id 21d6214505a0) contained Grafana v9.0.0-beta2:

> docker run -ti --rm  --entrypoint /usr/share/grafana/bin/grafana-server 21d6214505a0 -v
Version 9.0.0-beta2 (commit: 3ed722bb5c, branch: HEAD)

So to me it looks like at some stage by blindly upgrading the latest image, Grafana was updated to latest beta (which BTW on a plus side worked without issues) and then today downgraded to v8.5.5 leading to the panic. To resolve this I restored grafana DB from before this unintended upgrade and I can run the latest image again without problems.

As a side note, I'm aware running latest is generally not the best practice, in this environment I'm not super concerned about the availability or the data, just sharing this hoping it will help someone in similar situation.

mprasil avatar Jun 08 '22 09:06 mprasil

Very interesting observation. However, I think this is not related to the issue I have seen here. I doubt, we were using latest tag. Version in this helm chart are always pinned. (e.g. prom-stack 0.56.3 uses grafana 8.5.3)

mindrunner avatar Jun 09 '22 13:06 mindrunner

Yeah, I would assume so and I apologize for hijacking this issue, however this issue is pretty much the only remotely relevant thing I found when searching that specific error message.

mprasil avatar Jun 09 '22 16:06 mprasil

Got same error in grafana 8.5.3 as well. In the error, it is mentioned that "set force_migration=true in your grafana.ini".

  • Could you please let us know how to avoid this issue? i.e. how to identify alert rule which is having issue and how to fix it?
  • As mentioned in Grafana documentation, If we set force_migration to true then Force migration will run migrations that might cause data loss. DOes setting force_migration to true removes alert rules which are having issue OR it will remove all alert rules?

sriharshabm avatar Aug 17 '22 12:08 sriharshabm

Once grafana pod goes to this state, it is not recovereing. So it is a blocker. Could you please let us know any W/A? Will settting force_migration=true in your grafana.ini solves issue? If no, please sugest some other W/A. If yes, what data will be lost?

sriharshabm avatar Aug 22 '22 17:08 sriharshabm

We are still facing this issue? Are there any solutions to this?

slashr avatar Sep 29 '22 11:09 slashr

The issue seems related to switching from Unified Alerting back to Legacy Alerting. In the Grafana docs they say to have force_migration = True to revert back to Legacy Alerting, which will restore your alerts to what they were at the time the update took place

But there's an issue with helm charts you could run into next

levi-pole avatar Oct 04 '22 18:10 levi-pole

Had same issue, I enabled force_migration using extraEnvVars: GF_DEFAULT_FORCE_MIGRATION and lost all alerts except two of them that came with a new dashboard I imported couple of days back. I removed the two legacy type alerts and enabled back unified alerting. All is working now.

bogdanro avatar Nov 15 '22 20:11 bogdanro

My Grafana version is 8.5.15 and had got the same error. As said in the logs here:

lvl=eror msg="Critical error" reason="Grafana has already been migrated to Unified Alerting.\nAny alert rules created while using Unified Alerting will be deleted by rolling back.\n\nSet force_migration=true in your grafana.ini and restart Grafana to roll back and delete Unified Alerting configuration data.

Followed the same by updating chart with below and it worked ✅

  grafana.ini:
    force_migration: true

decipher27 avatar Apr 13 '23 06:04 decipher27