postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

Prometheus metrics endpoint

Open Yannig opened this issue 4 years ago • 18 comments

The purpose of this PR is to set up an entry point for Prometheus. For now, the metrics collected are relatively limited:

  • Number of bases created (pg_new_db)
  • Database synchronization status between Kube and the operator (pg_sync_status)

The goal is to be able to easily detect easily that the operator is no longer able to communicate with a PG cluster by setting up Network Policies that are a little too restrictive (any resemblance to real or existing facts is entirely possible).

I'm aware that the number of metrics is relatively limited and that it would be possible to obtain a lot more things. In fact, I would like to have a quick feedback on this feature.

Yannig avatar Jun 17 '21 14:06 Yannig

The compiled image with this operator is available at the following location: quay.io/yannig/postgres-operator:v1.6.3

Here the PodMonitor and PrometheusRule I'm using to integrate the postgres operator inside Prometheus Operator.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: postgres-operator
  namespace: operators
spec:
  namespaceSelector:
    matchNames:
    - operators
  podMetricsEndpoints:
  - port: "http"
  selector:
    matchLabels:
      app.kubernetes.io/name: postgres-operator
      app.kubernetes.io/instance: postgres-operator
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: postgres-operator
  namespace: operators
spec:
  groups:
  - name: postgres-operator.rules
    rules:
    - alert: PostgresOperatorDBSyncStatus
      for: 15m
      expr: pg_sync_status == 0
      annotations:
        summary: "Enable to communicate with postgres DB cluster"
        description: "Postgres operator is unable to communicate directly with PG cluster. Maybe a network policies is to restrictive."
      labels:
        severity: critical

You will also need to define the port definition inside the operator deployment definition:

...
        image: quay.io/yannig/postgres-operator:v1.6.3
        imagePullPolicy: Always
        name: postgres-operator
        ports:
        - containerPort: 8080
          name: http
          protocol: TCP
        resources:
          limits:
            cpu: 500m
            memory: 500Mi
          requests:
            cpu: 100m
            memory: 250Mi
...

Yannig avatar Jun 22 '21 13:06 Yannig

Sorry to insist but are you interested in this feature or not? Do you have any feedback for me on what should be done at least in addition?

We have implemented this in our cluster and it now allows us to know very easily if we have a communication problem between our bases and the operator.

Yannig avatar Jun 28 '21 16:06 Yannig

@Yannig sorry to keep you waiting. I'm not sure we want to add a Prometheus dependency. I thought, people were usually solving this via sidecars.

FxKu avatar Jul 07 '21 09:07 FxKu

The sidecars are used to monitor postgres itself. This PR is for monitoring the operator and have visibility on some operator events such as failed sync.

mboutet avatar Jul 07 '21 12:07 mboutet

@Yannig Looks like there is a conflict in a file, care to take a look?

@FxKu Up for adding this?

MPV avatar Sep 16 '21 14:09 MPV

@MPV sure, I'll do a rebase with the current branch.

Yannig avatar Sep 17 '21 06:09 Yannig

Rebase done with last version of the master branch

Yannig avatar Sep 17 '21 06:09 Yannig

By the way a version 1.7.0 with this patch is available using this image: quay.io/yannig/postgres-operator:v1.7.0

Yannig avatar Sep 17 '21 07:09 Yannig

Twice now I've discovered that my replicas weren't syncing only when I ran out of space on one of them; I'm sure I made some kind of mistake which lead to the problem, but just having something which exported the "lag in MB" would be a huge benefit as it would allow me to set up monitoring and alerting for the problem.

taxilian avatar Oct 04 '21 21:10 taxilian

@taxilian Sure, it could be a good feature. Maybe I can try to implement it.

Yannig avatar Oct 05 '21 08:10 Yannig

any news here?

HaveFun83 avatar Jan 03 '22 11:01 HaveFun83

@FxKu can we get this in, please? 🙏🏻 We need a way to monitor sync status from the operator's point of view.

Starefossen avatar Jan 24 '22 12:01 Starefossen

Any news here? There is already 1 approval. I would be glad to see the PR merged 😉

sebastiangaiser avatar Jun 02 '22 12:06 sebastiangaiser

Any chances to get this finally merged?

stephan2012 avatar Nov 08 '22 10:11 stephan2012

Hi! Thanks for contributing and a fair share of patience. The idea is good and welcome, everyone wants to monitor things. I might just want to challenge, the new database counter, this feels more like an example, than real value. Lets maybe agree on what you did in terms of syncs, and expose success and failed sync count and maybe a total count of databases observed by the operator at any given time?

Jan-M avatar Jan 16 '23 17:01 Jan-M

I don't think the OP necessarily anticipated that these would be the only metrics collected, rather was submitting something to provide a basic framework to start adding some.

Personally the thing I'd want to see most is the "Lag in MB" from patronictl status -- I have had one of the replicas stop syncing correctly a few times and there is no way to really grab it. Maybe sync status gives you that, but it sounds like it's more about something else and additionally it's not specific as to what is happening. I'd also want to know if it was just somehow not replicating fast enough, etc.

taxilian avatar Jan 16 '23 20:01 taxilian

Personally the thing I'd want to see most is the "Lag in MB" from patronictl status -- I have had one of the replicas stop syncing correctly a few times and there is no way to really grab it.

Take a look at https://github.com/gopaytech/patroni_exporter Could be implemented in zalando as a single sidecar or combined with another one (for example custom postgres/patroni exporter image)

jurim76 avatar Feb 05 '23 13:02 jurim76