grafana-operator icon indicating copy to clipboard operation
grafana-operator copied to clipboard

[Bug] 'deployment not ready' is reported and hang 'Phase: reconciling' status for grafana instance using v4.4.1 operator

Open lihongbj opened this issue 3 years ago • 19 comments

Describe the bug 'deployment not ready' is reported in grafana instance and event using v4.4.1 operator, after operator is installed and grafana cr instance is created.

Version v4.4.1

To Reproduce Not a full reproduction but this is how I found it:

  1. Installed operator v4.4.1 to a single namespace grafana in OCP operator hub.
  2. Created CR for an instance
  3. check instance, found Warning ProcessingError 105m (x4188 over 14h) GrafanaDashboard deployment not ready
  4. in OCP console web ui, go to Networking > Route and click grafana-route, grafana UI loading failed.

Expected behavior No error found and all cr up.

Runtime (please complete the following information):

  • OS: Linux
  • Grafana Operator Version v4.4.1
  • Environment: OCP 4.10.3
  • Deployment type: Installed operator v4.4.1 to a single namespace grafana in OCP operator hub.

Additional context

$ oc get pod
NAME                                                   READY   STATUS    RESTARTS   AGE
grafana-deployment-6c6d7cc447-dx6kh                    1/1     Running   0          104m
grafana-operator-controller-manager-6444c498b5-dgzrx   2/2     Running   0          14h

$ oc get grafanas.integreatly.org
NAME              AGE
example-grafana   14h

$ oc describe grafanas.integreatly.org example-grafana
...
Status:                                                                                                                                                                                                                                    
   Message:                success                                                                                                                                                                                                          
   Phase:                  reconciling                                                                                                                                                                                                      
   Previous Service Name:  grafana-service
Events:
  Type     Reason           Age                    From              Message
  ----     ------           ----                   ----              -------
  Warning  ProcessingError  105m (x4188 over 14h)  GrafanaDashboard  deployment not ready

lihongbj avatar May 10 '22 04:05 lihongbj

same issue here with Grafana Operator Version v4.4.0 & Version v4.4.1

R-Studio avatar May 10 '22 07:05 R-Studio

Is there a way to workaround this before we can get an official fix?

morningspace avatar May 10 '22 07:05 morningspace

@lihongbj have you had a chance to look at the operator logs?

pb82 avatar May 10 '22 11:05 pb82

@pb82 It looks there's really nothing special in operator logs. Or if there's any detailed log that we can enable? Here are details:

operator: kube-rbac-proxy

I0509 13:36:30.062223       1 main.go:190] Valid token audiences: 
I0509 13:36:30.062389       1 main.go:262] Generating self signed cert as no cert is provided
I0509 13:36:30.828736       1 main.go:311] Starting TCP socket on 0.0.0.0:8443
I0509 13:36:30.829339       1 main.go:318] Listening securely on 0.0.0.0:8443

operator: manager

I0509 13:36:48.839704       1 request.go:655] Throttling request took 1.04093473s, request: GET:https://172.30.0.1:443/apis/monitoring.coreos.com/v1?timeout=32s
I0509 13:36:58.882687       1 request.go:655] Throttling request took 1.046610156s, request: GET:https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1?timeout=32s

grafana-deployment:

{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000309Z","queryType":"exponential_heatmap_bucket_data"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000372Z","queryType":"linear_heatmap_bucket_data"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000399Z","queryType":"random_walk"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000411Z","queryType":"predictable_pulse"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000421Z","queryType":"predictable_csv_wave"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000435Z","queryType":"random_walk_table"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000445Z","queryType":"slow_query"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000454Z","queryType":"no_data_points"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000470Z","queryType":"datapoints_outside_range"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000481Z","queryType":"manual_entry"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000495Z","queryType":"csv_metric_values"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000506Z","queryType":"streaming_client"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000515Z","queryType":"live"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000530Z","queryType":"grafana_api"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000561Z","queryType":"arrow"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000572Z","queryType":"annotations"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000581Z","queryType":"table_static"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000594Z","queryType":"random_walk_with_error"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000604Z","queryType":"server_error_500"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000621Z","queryType":"logs"}
{"@level":"debug","@message":"datasource: registering query type handler","@timestamp":"2022-05-10T02:47:29.000663Z","queryType":"node_graph"}
{"@level":"debug","@message":"datasource: registering query type fallback handler","@timestamp":"2022-05-10T02:47:29.000682Z"}

morningspace avatar May 10 '22 14:05 morningspace

Is it possible to install an old version of grafana operator?

morningspace avatar May 10 '22 14:05 morningspace

Can we get some basic kubernetes debugging. What is the issue in the deployment. Please share the deployment yaml that gets generated and share the oc describe pod <grafana-deployment>.

Also please provide the grafana yaml that you are using where you are getting this issue.

nissessenap avatar May 13 '22 08:05 nissessenap

Just a quick update, it looks ultimately I can deploy grafana and create dashboard successfully. Probably it's just in the middle of the process. After a while, I can see Grafana CR:

Status:
  Message:                success
  Phase:                  reconciling
  Previous Service Name:  grafana-service
Events:                   <none>

morningspace avatar May 13 '22 10:05 morningspace

@lihongbj

morningspace avatar May 13 '22 10:05 morningspace

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

github-actions[bot] avatar Jun 12 '22 11:06 github-actions[bot]

Any news?

R-Studio avatar Jun 13 '22 07:06 R-Studio

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

github-actions[bot] avatar Jul 13 '22 08:07 github-actions[bot]

Any news?

R-Studio avatar Jul 14 '22 06:07 R-Studio

@R-Studio we just released version 4.5.0, please give that a try. And if that don't help please provide some debug information, my understanding from the others in this issue is that it's solved for them. As I wrote earlier, there is some basic k8s debugging that is needed. If your deployment is pending, why is it pending? Is the ingress successfully created (assuming that you use ingress)?

nissessenap avatar Jul 14 '22 08:07 nissessenap

@NissesSenap thanks for your reply. The deployment of Grafana, Dashboards and Datasources (ingress...) are working, but the resource with kind "Grafana" is still hanging in the phase "reconciling". Maybe it is correct that the "Grafana" remains in phase "reconciling"?

kubectl describe pod grafana-deplyoment

Name:               grafana-deployment-8566cf9b9-86zkg
Namespace:          grafana
Priority:           0
PriorityClassName:  <none>
Node:               k8s-prod-w1/10.0.0.63
Start Time:         Sun, 17 Jul 2022 17:10:46 +0200
Labels:             app=grafana
                    app.kubernetes.io/instance=grafana-operator
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=grafana-operator
                    helm.sh/chart=grafana-operator-2.6.5
                    pod-template-hash=8566cf9b9
Annotations:        cni.projectcalico.org/containerID=2fbd91affd31cfa001c71d331bece1038e02c50f522be1eb60e8e2932237c44a
                    cni.projectcalico.org/podIP=198.18.1.243/32
                    cni.projectcalico.org/podIPs=198.18.1.243/32
                    kubernetes.io/psp=unrestricted-psp
                    prometheus.io/port=3000
                    prometheus.io/scrape=true
Status:             Running
IP:                 198.18.1.243
Controlled By:      ReplicaSet/grafana-deployment-8566cf9b9
Init Containers:
  grafana-plugins-init:
    Container ID:   docker://a142be62c9bc299d635433649c728b629e664eb5e4932743cfa4c99574091d3d
    Image:          docker.io/bitnami/grafana:8.5.6-debian-11-r8
    Image ID:       docker-pullable://bitnami/grafana@sha256:9d790f1b5586943f0a409b1630c1723035b07a76578898a9346ac579c5595ef4
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Sun, 17 Jul 2022 17:13:08 +0200
      Finished:     Sun, 17 Jul 2022 17:13:08 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:     250m
      memory:  128Mi
    Environment:
      GRAFANA_PLUGINS:
    Mounts:
      /opt/plugins from grafana-plugins (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9gn8b (ro)
Containers:
  grafana:
    Container ID:  docker://ef3fbc85aa5120c40701088e69dfcd1a2fa078d1b43c2d9f1d27e12e2f45e24c
    Image:         docker.io/bitnami/grafana:8.5.6-debian-11-r8
    Image ID:      docker-pullable://bitnami/grafana@sha256:9d790f1b5586943f0a409b1630c1723035b07a76578898a9346ac579c5595ef4
    Port:          3000/TCP
    Host Port:     0/TCP
    Args:
      -config=/etc/grafana/grafana.ini
    State:          Running
      Started:      Sun, 17 Jul 2022 17:13:09 +0200
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1536Mi
    Requests:
      cpu:      500m
      memory:   500Mi
    Liveness:   http-get http://:3000/api/health delay=120s timeout=5s period=1s #success=1 #failure=6
    Readiness:  http-get http://:3000/api/health delay=30s timeout=5s period=1s #success=1 #failure=6
    Environment Variables from:
      grafana-secrets  Secret  Optional: false
    Environment:
      LAST_CONFIG:       38088a668a113410dd9fa57621301724b292da062f5ea51358b9e9d35e825977
      LAST_DATASOURCES:  fba06f372ca63acfc64d883042ee60f6947b7142b9cfd88ea307d0dfe1d5dfdf
    Mounts:
      /etc/grafana-configmaps/ldap-config from configmap-ldap-config (rw)
      /etc/grafana/ from grafana-config (rw)
      /etc/grafana/provisioning/dashboards from grafana-provision-dashboards (rw)
      /etc/grafana/provisioning/datasources from grafana-datasources (rw)
      /etc/grafana/provisioning/notifiers from grafana-provision-notifiers (rw)
      /etc/grafana/provisioning/plugins from grafana-provision-plugins (rw)
      /var/lib/grafana from grafana-data (rw)
      /var/lib/grafana/plugins from grafana-plugins (rw)
      /var/log/grafana from grafana-logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9gn8b (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  grafana-provision-plugins:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  grafana-provision-dashboards:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  grafana-provision-notifiers:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  grafana-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      grafana-config
    Optional:  false
  grafana-logs:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  grafana-data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  grafana-pvc
    ReadOnly:   false
  grafana-plugins:
    Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  grafana-datasources:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      grafana-datasources
    Optional:  false
  configmap-ldap-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      ldap-config
    Optional:  false
  kube-api-access-9gn8b:
  <unknown>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
kubectl describe grafana -n grafana

Status:
  Message:                success
  Phase:                  reconciling
  Previous Service Name:  grafana-service
Events:                   <none>

R-Studio avatar Jul 18 '22 12:07 R-Studio

@R-Studio so everything is working for you except that it gives the wrong status? Since the deployment is ready the health checks passes so the grafanan instance should be okay.

Can you share your grafana yaml? What happens if you make the config supper basic? Does it pass then?

Do you have the same log entry as defined earlier in this issue? If not please share the logs.

Can you try to uninstall everything and especially remove the CRD:s installing it all again.

nissessenap avatar Jul 18 '22 18:07 nissessenap

I've similar issue. Operator version: v4.5.1 Deployed on openshift Phase: reconciling I was able to create data source CR but only this one. For dashboards even simple one from operator examples is returning "deployment no ready" Notifications also tested and not working. No event message. Logs the same as for @morningspace Unfortunately can't redeploy CDRs at this time. Grafana CRs config is basic:

config:
auth:
  disable_login_form: false
  disable_signout_menu: true
auth.anonymous:
  enabled: false
security:
  admin_password: xxx
  admin_user: xxx

marcinsztalmi avatar Aug 18 '22 11:08 marcinsztalmi

My bad. Did not noticed that setting dashboardLabelSelector is mandatory. After that dashboards and notifications are loaded successfully.

marcinsztalmi avatar Aug 19 '22 05:08 marcinsztalmi

This issue hasn't been updated for a while, marking as stale, please respond within the next 7 days to remove this label

github-actions[bot] avatar Sep 18 '22 06:09 github-actions[bot]

@marcinsztalmi I already configured the dashboardLabelSelector but no difference (Grafana is still in State Reconciling). This is how my configuration about 'dashboardLabelSelector' looks like. (Because I use the Bitnami Helm Chart I have to use 'dashboardLabelSelectors' instead of 'dashboardLabelSelector':

grafana:
  dashboardLabelSelectors:
    - matchExpressions:
        - {key: app, operator: In, values: [grafana]}

R-Studio avatar Sep 19 '22 06:09 R-Studio

@R-Studio are you seeing this issue still with the latest version?

pb82 avatar Oct 04 '22 12:10 pb82

@pb82 I have just updated it, but still in state Reconciling. But I think everything else works.

R-Studio avatar Oct 10 '22 06:10 R-Studio

We are seeing this on Grafana Operator v4.4.1 in OpenShift.

Seems like the API Schema for GrafanaDashboard has the "labels" object, while the GrafanaDataSource and GrafanaNotificationChannel API Schema does not on integreatly.org | v1alpha1

A notificationchannel deployment with label app=grafana would result in instant reconcilling of object.

Redeploying the deployment without the label app=grafana would be successful.

We need the app=grafana was that is matchlabel for our operator to find the monitoring object for multi-namespaces.

I don't see how this cannot be resolvable if we simply add the label object into all the Grafana Objects API Schema.

gr-ahh avatar Oct 19 '22 20:10 gr-ahh

Is anyone from Grafana looking into this?

Seems to be substantial issue considering any GrafanaDashboard using labels (which is required for the artifact to be picked up by the operator) to have the artifact in reconciling status.

gr-ahh avatar Oct 31 '22 19:10 gr-ahh

@gr-ahh this project isn't supported by grafana. It's a project maintained and developed by mostly redhat employees but not exclusively, we work on this project on our spare time and it's not part of our paid job.

To answer your question, no we are currently not working on it. We haven't gotten any instructions on how to reproduce this issue and this specific issue contains problem from multiple users. Most of which have been solved and they where due to configuration errors.

If you have exact steps of reproducing your issue please share and depending if it's related to this issue or not create a new one. Probably it would be better to create a new one since this contains multiple issue as noted earlier.

nissessenap avatar Oct 31 '22 21:10 nissessenap

Since this issue contains multiple issues and i don't know which one is which any more i will close this issue.

Please feel free to create a new issue where you give reproducible steps. And please update to the latest version of the operator, else it will be very hard for us to debug any issues.

I will lock and comments in this issue.

nissessenap avatar Oct 31 '22 21:10 nissessenap