compliantkubernetes-apps icon indicating copy to clipboard operation
compliantkubernetes-apps copied to clipboard

Added prometheus metrics to the diagnostics script

Open Elias-elastisys opened this issue 1 year ago • 8 comments

[!warning] This is a public repository, ensure not to disclose:

  • [x] personal data beyond what is necessary for interacting with this pull request, nor
  • [x] business confidential information, such as customer names.

What kind of PR is this?

Required: Mark one of the following that is applicable:

  • [x] kind/feature
  • [ ] kind/improvement
  • [ ] kind/deprecation
  • [ ] kind/documentation
  • [ ] kind/clean-up
  • [ ] kind/bug
  • [ ] kind/other

Optional: Mark one or more of the following that are applicable:

[!important] Breaking changes should be marked kind/admin-change or kind/dev-change depending on type Critical security fixes should be marked with kind/security

  • [ ] kind/admin-change
  • [ ] kind/dev-change
  • [ ] kind/security
  • [ ] kind/adr

What does this PR do / why do we need this PR?

Adds the possibility to query prometheus metrics using the diagnostics script.

  • Fixes https://github.com/elastisys/ck8s-issue-tracker/issues/285

Information to reviewers

Do we want more metrics?

Here is an example output
❯ ./bin/ck8s diagnostics sc --with-metrics-since 1day
[ck8s] Running diagnostics...
[ck8s] Validating sc config
[ck8s] Using unencrypted kubeconfig /home/elias/Documents/environments/coffee-brewer-dev/.state/kube_config_sc.yaml
------------------------------------------------------------------------------------------------------------------
Querying fluentd output error rate.
Fluentd output error rate over 0 on the dates:
ons 14 aug 2024 03:02:00 CEST
ons 14 aug 2024 08:44:00 CEST
ons 14 aug 2024 08:45:00 CEST

------------------------------------------------------------------------------------------------------------------
Querying dropped packages.
Found dropped packages going to pod: "fluentd-aggregator-0" on dates:
ons 14 aug 2024 08:44:00 CEST
ons 14 aug 2024 08:45:00 CEST

Found dropped packages going from pod: "fluentd-forwarder-j54f2" on dates:
ons 14 aug 2024 08:42:00 CEST
ons 14 aug 2024 08:43:00 CEST
ons 14 aug 2024 08:44:00 CEST
ons 14 aug 2024 08:45:00 CEST

Found dropped packages going from pod: "fluentd-forwarder-kh98r" on dates:
ons 14 aug 2024 08:42:00 CEST
ons 14 aug 2024 08:43:00 CEST
ons 14 aug 2024 08:44:00 CEST
ons 14 aug 2024 08:45:00 CEST

Found dropped packages going from pod: "fluentd-forwarder-qlbjq" on dates:
ons 14 aug 2024 08:42:00 CEST
ons 14 aug 2024 08:43:00 CEST
ons 14 aug 2024 08:44:00 CEST
ons 14 aug 2024 08:45:00 CEST

Found dropped packages going from pod: "fluentd-forwarder-svtsz" on dates:
ons 14 aug 2024 08:42:00 CEST
ons 14 aug 2024 08:43:00 CEST
ons 14 aug 2024 08:44:00 CEST
ons 14 aug 2024 08:45:00 CEST

Found dropped packages going from pod: "thanos-query-query-frontend-85f8d59458-fhthk" on dates:
tis 13 aug 2024 16:54:00 CEST
tis 13 aug 2024 16:55:00 CEST

Found dropped packages going to pod: "thanos-query-query-frontend-85f8d59458-fhthk" on dates:
tis 13 aug 2024 16:54:00 CEST
tis 13 aug 2024 16:55:00 CEST

Found dropped packages going from pod: "hnc-controller-webhook-57777c7778-8gxv7" on dates:
tis 13 aug 2024 16:42:00 CEST

------------------------------------------------------------------------------------------------------------------
Querying uptime.
The target "fluentd" was down on dates:
ons 14 aug 2024 08:44:00 CEST
ons 14 aug 2024 08:45:00 CEST

------------------------------------------------------------------------------------------------------------------
Querying Opensearch cluster status
Opensearch is in "yellow" state!

Checklist

  • [x] Proper commit message prefix on all commits
  • Change checks:
    • [x] The change is transparent
    • [ ] The change is disruptive
    • [x] The change requires no migration steps
    • [ ] The change requires migration steps
    • [ ] The change upgrades CRDs
    • [ ] The change updates the config and the schema
  • Metrics checks:
    • [ ] The metrics are still exposed and present in Grafana after the change
    • [ ] The metrics names didn't change (Grafana dashboards and Prometheus alerts are not affected)
    • [ ] The metrics names did change (Grafana dashboards and Prometheus alerts were fixed)
  • Logs checks:
    • [ ] The logs do not show any errors after the change
  • Pod Security Policy checks:
    • [ ] Any changed pod is covered by Pod Security Admission
    • [ ] Any changed pod is covered by Gatekeeper Pod Security Policies
    • [ ] The change does not cause any pods to be blocked by Pod Security Admission or Policies
  • Network Policy checks:
    • [ ] Any changed pod is covered by Network Policies
    • [ ] The change does not cause any dropped packets in the NetworkPolicy Dashboard
  • Audit checks:
    • [ ] The change does not cause any unnecessary Kubernetes audit events
    • [ ] The change requires changes to Kubernetes audit policy
  • Falco checks:
    • [ ] The change does not cause any alerts to be generated by Falco
  • Bug checks:
    • [ ] The bug fix is covered by regression tests

Elias-elastisys avatar Aug 14 '24 13:08 Elias-elastisys

I've not yet looked into the specifics, but this looks more and more like we are starting to export everything which we usually get access to by having access to the clusters and the dashboards themselves. It does not sounds feasible that we continue down this road as eventually we will make a complete data export of everything just to be able to debug customer clusters.

Could this be solved differently? VPN access? Customer OPS training? Ping @cristiklein @llarsson @elastisys/sme

simonklb avatar Aug 14 '24 14:08 simonklb

I've not yet looked into the specifics, but this looks more and more like we are starting to export everything which we usually get access to by having access to the clusters and the dashboards themselves. It does not sounds feasible that we continue down this road as eventually we will make a complete data export of everything just to be able to debug customer clusters.

Could this be solved differently? VPN access? Customer OPS training? Ping @cristiklein @llarsson @elastisys/sme

Thanks @simonklb. Actually, we discussed this in a meeting before summer, the meeting notes are linked in the issue (please let me know if you don't have access to it). Generally speaking, we aim to be able to support customers even those who are running air-gapped environments, and some of them might need to limit our access. That's why we are trying improve diagnostic script in such way.

salehsedghpour avatar Aug 15 '24 05:08 salehsedghpour

If yes, can you make sure that we have proper documentation in public docs and/or in usage script?

As it is now, it is always enabled if thanos is enabled. Thought it was such a small thing anyways. Should I change that?

Elias-elastisys avatar Aug 15 '24 06:08 Elias-elastisys

If yes, can you make sure that we have proper documentation in public docs and/or in usage script?

As it is now, it is always enabled if thanos is enabled. Thought it was such a small thing anyways. Should I change that?

I'm not sure about the impacts that it might have on our security promises. Do you have any input on this @cristiklein?

salehsedghpour avatar Aug 15 '24 08:08 salehsedghpour

I'm not sure about the impacts that it might have on our security promises. Do you have any input on this @cristiklein?

For some context, you need the ServiceAccount token in order to get access by the kubernetes API. Which means the person running the diagnostics scripts needs access to describe secrets in the thanos namespace in order to even get past kube api. So in my opinion I did not feel it opened up any new security issues.

Elias-elastisys avatar Aug 15 '24 09:08 Elias-elastisys

Right, I was mainly thinking about creating SA (I believe it's a quick procedure), and then get the data and then delete the created resources when calling diagnostic script. The main idea is to close everything unless you need it.

But I see your point, it might not have any drastic impact. But I would love to see what @OlleLarsson and @cristiklein think about it.

salehsedghpour avatar Aug 20 '24 11:08 salehsedghpour

But I see your point, it might not have any drastic impact. But I would love to see what @OlleLarsson and @cristiklein think about it.

See my comment on the line of code. I trust that the current solution has a good rationale behind it, however, having an intermediate token than the user retrieves kind of smells. Can we make the script work without this token, i.e., with the KUBECONFIG of the SC directly?

cristiklein avatar Aug 20 '24 14:08 cristiklein

Changed to separate commands. Changed token to oidc-login and removed the SA/rbac chart. PTAL

Elias-elastisys avatar Aug 27 '24 08:08 Elias-elastisys