oncall
oncall copied to clipboard
Auth issues installing Oncall plugin
What went wrong?
What happened:
- I've configured the plugin to connect to the engine. Everything appears to be ok and at least the UI indicates that the plugin was installed. Though there are a couple entries like:
source=engine:app google_trace_id=none logger=apps.grafana_plugin.helpers.client Error connecting to api instance Expecting value: line 1 column 1 (char 0)
in the engine logs.
Now after Open Grafana OnCall
is clicked the page shows OnCall was not able to load the current user. Try refreshing the page
. Call to api/plugin-proxy/grafana-oncall-app/api/internal/v1/
fail with "Non-existent or anonymous user."
The engine logs read:
source=engine:app google_trace_id=none logger=apps.auth_token.auth Could not get user from grafana request. Context {'UserID': 1, 'OrgID': 1, 'OrgName': 'Main Org.', 'OrgRole': 'Admin', 'ExternalAuthModule': '', 'ExternalAuthID': '', 'Login': 'admin', 'Name': '', 'Email': 'admin@localhost', 'ApiKeyID': 0, 'IsServiceAccount': False, 'OrgCount': 1, 'IsGrafanaAdmin': True, 'IsAnonymous': False, 'IsDisabled': False, 'HelpFlags1': 0, 'LastSeenAt': '2023-09-15T11:00:49Z', 'Teams': [], 'Analytics': {'Identifier': 'admin@localhost@https://grafana.develop.argyle.io/', 'IntercomIdentifier': ''}}
source=engine:app google_trace_id=none logger=root inbound latency=0.033159 status=401 method=OPTIONS path=/api/internal/v1/escalation_policies user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15 content-length=0 slow=0
source=engine:app google_trace_id=none logger=django.request Unauthorized: /api/internal/v1/escalation_policies
source=engine:uwsgi status=401 method=OPTIONS path=/api/internal/v1/escalation_policies latency=0.034497 google_trace_id=- protocol=HTTP/1.1 resp_size=284 req_body_size=0
What did you expect to happen:
- Plugin is installed and actually opens
I've tried this both with the default admin account and Google managed account that has been given admin permissions with the same results.
How do we reproduce it?
I installed the helm chart using the following value config:
grafana:
enabled: false
externalGrafana:
url: http://k8s-grafana
ingress-nginx:
enabled: false
ingress:
enabled: false
cert-manager:
enabled: false
mariadb:
enabled: false
postgresql:
enabled: true
database:
type: postgresql
The same cluster is running Grafana version 10.0.1 (5a30620b85)
Grafana OnCall Version
v1.3.37
Product Area
Auth
Grafana OnCall Platform?
Kubernetes
User's Browser?
No response
Anything else to add?
No response
After further investigation it seems like on install the Oncall plugin fails to read org and user data and then all further calls fail because of this as it has no userdata to use. This just a theory I guess but it still doesn't explain why the grafana api returns empty response on those api/org
, api/org/users/
, and api/teams/search
calls.
Do we have any fix for this ?
Same issue, using Grafana 9.5.3 with oncall plugin+engine version 1.3.43. Seeing a mix of 401 and 403 in the engine logs.
2023-10-12 12:23:57 source=engine:app google_trace_id=none logger=django.request Forbidden: /api/internal/v1/alert_receive_channels/integration_options
2023-10-12 12:23:57 source=engine:uwsgi status=403 method=GET path=/api/internal/v1/alert_receive_channels/integration_options latency=0.012331 google_trace_id=- protocol=HTTP/1.1 resp_size=249 req_body_size=0
2023-10-12 12:23:57 source=engine:app google_trace_id=none logger=apps.auth_token.auth Could not get user from grafana request. Context {'UserID': 1, 'OrgID': 1, 'OrgName': 'Main Org.', 'OrgRole': 'Admin', 'ExternalAuthModule': '', 'ExternalAuthID': '', 'Login': 'admin', 'Name': 'Admin', 'Email': 'admin@localhost', 'ApiKeyID': 0, 'IsServiceAccount': False, 'OrgCount': 1, 'IsGrafanaAdmin': True, 'IsAnonymous': False, 'IsDisabled': False, 'HelpFlags1': 0, 'LastSeenAt': '2023-10-12T12:23:11Z', 'Teams': [], 'Analytics': {'Identifier': 'admin@localhost@http://localhost/grafana/', 'IntercomIdentifier': ''}}
i had exactly same issue, but this happened only in my develop env. i'd assume if you check celery logs in real time you might see this
Max retries exceeded with url: /api/org (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)')))
and other api(s) from grafana, this happened since i use self signed certificates for develop env, so celery failed, i did not found any way to skip ssl verification by manipulating the helm chart, so i just pointed to the http (80) of grafana locally inside the k8s cluster, all started to work.
i had exactly same issue, but this happened only in my develop env. i'd assume if you check celery logs in real time you might see this
Max retries exceeded with url: /api/org (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)')))
and other api(s) from grafana, this happened since i use self signed certificates for develop env, so celery failed, i did not found any way to skip ssl verification by manipulating the helm chart, so i just pointed to the http (80) of grafana locally inside the k8s cluster, all started to work.
Thanks, you pointed me in the right direction! It wasn't SSL error in my case but when I dug into the celery logs I found dns errors, which led me to a FQDN misconfig (my fault) of the GRAFANA_API_URL. After correcting that and reconnecting the plugin the auth issue is gone.
So I guess there can be a number of different root causes for this, I was just not looking in the right place for clues (or understanding how all pieces fit together) :)
What GRAFANA_API_URL did you use?
i had exactly same issue, but this happened only in my develop env. i'd assume if you check celery logs in real time you might see this
Max retries exceeded with url: /api/org (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1002)')))
and other api(s) from grafana, this happened since i use self signed certificates for develop env, so celery failed, i did not found any way to skip ssl verification by manipulating the helm chart, so i just pointed to the http (80) of grafana locally inside the k8s cluster, all started to work.
Thanks, you pointed me in the right direction! It wasn't SSL error in my case but when I dug into the celery logs I found dns errors, which led me to a FQDN misconfig (my fault) of the GRAFANA_API_URL. After correcting that and reconnecting the plugin the auth issue is gone.
So I guess there can be a number of different root causes for this, I was just not looking in the right place for clues (or understanding how all pieces fit together) :)
@Polinth I think I have the same issues but how did you configure this? Did you just change the GRAFANA_API_URL in the .env or sth else? For the first case it sadly still doesn't work even though when I do a nslookup in the containers the name does indeed resolve. My filtered logs:
celery_1 | 2024-02-01 19:12:31,372 source=engine:celery worker=ForkPoolWorker-2 task_id=cb61aaf3-6039-427e-a563-a86e7fbf4fb9 task_name=apps.grafana_plugin.tasks.sync.plugin_sync_organization_async name=apps.grafana_plugin.helpers.client level=WARNING Error connecting to api instance HTTPConnectionPool(host='grafana', port=3000): Max retries exceeded with url: /api/access-control/users/permissions/search?actionPrefix=grafana-oncall-app (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc0b09a25d0>: Failed to establish a new connection: [Errno -2] Name does not resolve')) celery_1 | 2024-02-01 19:12:31,374 source=engine:celery worker=ForkPoolWorker-2 task_id=cb61aaf3-6039-427e-a563-a86e7fbf4fb9 task_name=apps.grafana_plugin.tasks.sync.plugin_sync_organization_async name=apps.grafana_plugin.helpers.client level=WARNING Error connecting to api instance HTTPConnectionPool(host='grafana', port=3000): Max retries exceeded with url: /api/org (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc0b09ba210>: Failed to establish a new connection: [Errno -2] Name does not resolve')) engine_1 | 2024-02-01 19:12:31 source=engine:app google_trace_id=none logger=apps.grafana_plugin.helpers.client Error connecting to api instance HTTPConnectionPool(host='grafana', port=3000): Max retries exceeded with url: /api/access-control/users/permissions/search?actionPrefix=grafana-oncall-app (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdecf0835d0>: Failed to establish a new connection: [Errno -2] Name does not resolve')) engine_1 | 2024-02-01 19:12:31 source=engine:app google_trace_id=none logger=apps.grafana_plugin.helpers.client Error connecting to api instance HTTPConnectionPool(host='grafana', port=3000): Max retries exceeded with url: /api/org (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fdecf083ed0>: Failed to establish a new connection: [Errno -2] Name does not resolve'))
I think I have the same issues but how did you configure this? Did you just change the GRAFANA_API_URL in the .env or sth else? For the first case it sadly still doesn't work even though when I do a nslookup in the containers the name does indeed resolve. My filtered logs:
I was using the helm charts as a base for my deployment, and there were multiple instances in those definitions that looked like this:
- name: GRAFANA_API_URL
value: "http://grafana.grafana-test.svc"
with format being
http://<service>.<namespace>.svc
That does indeed then get populated as env within the deployments, most notably the celery one where I found the errors. After I properly edited all of those entries and re-applied the yaml it started working for me.
I didn't move forward with any of this setup after my testing, so it's been a while since I touched it, but I hope I remember correctly.
closed in favour of https://github.com/grafana/oncall/issues/3772