postgres-operator icon indicating copy to clipboard operation
postgres-operator copied to clipboard

add monitoring

Open theRealWardo opened this issue 7 years ago • 41 comments

what does zalando do for postgres monitoring with any databases run via this operator?

I was thinking of building https://github.com/wrouesnel/postgres_exporter into the database container and having that be monitored via our prometheus operator.

is there any existing plans to add monitoring directly into this project in some way? if not, is there a need for a more detailed discussion/approach prior to contribution or shall I do as the contribution guidelines say and just hack away and send a PR?

theRealWardo avatar Mar 07 '18 10:03 theRealWardo

Quick answer is no, there is no intent to make the operator "monitor" anything. Ideally the operator focuses on "operation" and more specifically on the provisioning and modifying part. The "ops" part we largely leave to Patroni which is very well suited for taking care of the cluster itself.

The operator however does contain a very slim API to allow monitoring it from the outside.

At Zalando we use ZMON (zmon.io) for all monitoring. But there is other options here like Prometheus.

We are running Postgres with the bg_mon extension exposing a lot of Postgres data via a rest API on port 8080 so this helps a lot I think.

Jan-M avatar Mar 07 '18 10:03 Jan-M

thanks for the quick reply! to be clear I'm not proposing monitoring the operator itself but rather the database it is operating on. if there is something in the operator that you monitor and feel others should monitor please do let me know! otherwise our system will probably just be monitoring that the pod is up and running.

what I'd like to add to this operator to facilitate that is flag would would add a simple named monitoring port on the ServiceSpec. that would enable me to have a ServiceMonitor (custom resource) which my Prometheus operator would then be able to turn into scrape targets for my Prometheus instance. does that sound reasonable?

theRealWardo avatar Mar 07 '18 10:03 theRealWardo

I forgot one tool here, just teasing it as we have not release it yet, but teams rely on our new pgview web interface to monitor their DBs too and it has proven very useful.

image

Jan-M avatar Mar 07 '18 10:03 Jan-M

for that kind of web dashboard thing we've been running https://github.com/ankane/pghero which has definitely helped us a couple times but it doesn't hook into our alerting systems which is what I'm really trying to achieve here.

theRealWardo avatar Mar 07 '18 10:03 theRealWardo

Operator monitoring: We have not figured this out completely, one part here is def. user experience making sure the operator is quick to provisioning new clusters and applying changes triggered by the user but other than that we more or less monitor that the pod is running which is not that helpful and informative.

Database monitoring: We don't consider this a task of the operator and our operator is not required once the database is "deployed" as Patroni does all the magic for high availability and failover, which makes the operator itself much smaller in scope and much less important.

To monitor clusters as said above, both postgres and patroni have REST apis that are easy to monitor.

Jan-M avatar Mar 07 '18 10:03 Jan-M

I adapted the operator to deploy the postgres exporter as a sidecar container (Instead of running it inside the spilo container). With this we can get metrics to prometheus. So the operator is not monitoring anything just helps with the deployment. What you guys think?

stefanhipfel avatar Apr 12 '18 09:04 stefanhipfel

We had the discussion once for arbitrary side car definition support, but scratched this until the need arises. Feel free to PR this or frame it in an issue, as this could become anything from simple to very generic.

Maybe we can also go for "prometheus" sidecar similar static as the Scalyr side car. Can you dump your side car definition here so we can have a look?

Jan-M avatar Apr 13 '18 13:04 Jan-M

I am closing this.

The sidecar feature that we currently use for scalyr only in a hard coded way may see some improvements and become more generic, and then also serve the purpose of adding e.g. the postgres exporter as a sidecar via the operator.

Jan-M avatar May 29 '18 14:05 Jan-M

how about we keep this open and I send you a PR? I'll try to get you one this week which will add a monitoring side car option if you are okay with that.

theRealWardo avatar May 29 '18 22:05 theRealWardo

Sure, PRs or Idea Sketches are very welcome. Maybe you can outline your idea briefly, as we have some ongoing discussions internally on how sidecars should look like: from toggled hard coded examples like Scalyr now to a very generic approach.

Jan-M avatar May 30 '18 08:05 Jan-M

@Jan-M would be great to see that discussion here in the Open Source project, so others can comment/join.

hjacobs avatar May 30 '18 08:05 hjacobs

sure! so if I were to bring up the most important things for adding monitoring to this project:

  • make it easy for some common use case(s)
  • make it clear how to add other monitoring solutions

I think we should start by focusing on 2 common use cases, documenting them, and changing the project's current language of Monitoring of clusters is not in scope, for this good tools already exist from ZMON to Prometheus and more Postgres specific options.:

  • your case of using ZMON + pg_view and friends seems like it can be achieved simply via a modified image, right? I think this case is supported in the current design. this is interesting because it doesn't require additional permissions and instead builds it into postgres. let's document how to do this one.
  • I think common use case for a lot of us is a sidecar container. this would enable my goal of prometheus monitoring with something the exporter I linked above or a telegraf container. I'd propose we start by extending the current sidecar support with a monitoring specific sidecar that can be enabled. this will be trickier than the baked in approach because most of these processes running in the sidecar will require a connection URL. I believe using the superuser here is a bad idea as it can impact Patroni fail overs, correct? so using the correct user/permissions has to be figured out for this...

a bit more technical details of what I am proposing for monitoring side cars specifically:

  • no one wants to copy pasta a ton of config, so provide two options - configure monitoring sidecars on the operator or the cluster.
  • the default should just work so simply specifying monitoring_docker_image to whatever image should be run as a sidecar should just work assuming:
    • the image is passed the following environment variables: POSTGRES_USER, POSTGRES_PASSWORD (and it obviously is configured to use them correctly)
    • that POSTGRES_USER is granted the correct permissions
  • for those of us running the Prometheus Operator, we'll apply a specific label to make our ServiceMonitor pick up these pods

going to sketch some code and share it shortly to get a bit more specific and hopefully keep the discussion going. thoughts here though?

theRealWardo avatar May 31 '18 06:05 theRealWardo

Just a very quick remark: Imho monitoring is still not in scope of the operator, despite that the side cars should be supported and are a good idea.

For me the essence is that the operator should itself not start to "monitor" metrics or become e.g. a metric gateway/proxy.

Jan-M avatar May 31 '18 08:05 Jan-M

Hi @theRealWardo,

I would some similar thoughts along the line of supporting any sidecar, not necessary monitoring (for instance, ours is doing log exporting and others may also do something like regular manual vacuuming, index rebuild or backup or backup, or even running 3rd party applications that do something (i.e. export the data somewhere else). Most of them, in general, need access to the PGDATA/logs and many also need access to the database itself.

The set of parameters you came with looks good to me. We could also pass the role name that should be defined inside the infrastructure roles, and the operator would perform the job of passing the role name and the password from their to the cluster. However, in some cases it might be necessary to connect as a superuser, whose password is per-cluster.

Another idea is to expose the unix socket inside the volume mount of github.com/zalando/spilo, so that other containers running in the same pod can connect with a unix socket and user postgres without a password.

In order to fully support this, we would also need something along the line of pod_environment_configmap (custom environment variables injected in every pod) to be propagated to the sidecar, and also have a similar options for passing global secret object (as in many cases values like external API keys cannot be trusted to mere configmaps) to expose secrets inside it to each container as environment variables.

I am not sure about the labels. It is not possible to apply labels to individual containers within the pod, what we could do is to apply a sidecar label with the name of the sidecar. However, it looks redundant to me, since one can always instruct monitoring to look for pods with the set of cluster_labels configured in the operator.

I'll look into your PR and will also do the global secrets when I have time.

alexeyklyukin avatar May 31 '18 09:05 alexeyklyukin

so I modified my PR to add generic sidecar support. it allows users to add as many sidecars as they like to each of the pods running their clusters. this is sufficient to meet our use cases, and could be used by your team in place of the current Scalyr specific stuff.

we are going to try and run 2 sidecar containers actually. we'll be running one that does log shipping via Filebeat and another that does monitoring via Postgres Exporter.

hopefully this PR will enable other interesting uses too.

theRealWardo avatar Jun 01 '18 21:06 theRealWardo

@theRealWardo how are you passing in the env vars to Postgres Exporter like DATA_SOURCE_NAME as the ones available from the postgres operator are different and i.e. POSTGRES_* or do you create another container based on the one available for postgres exporter for inclusion as a sidecar?

pitabwire avatar Mar 03 '19 15:03 pitabwire

right @pitabwire - we use a sidecar, 2 of them actually. one that ships logs and one that does monitoring.

theRealWardo avatar Mar 04 '19 15:03 theRealWardo

@theRealWardo you could guide on this. I tried to pass in the environment variables but for some reason they are not being picked in the container for postgres exporter, I get the error below

kubectl logs -n datastore -f tester-events-cluster-0 pg-exporter time="2019-03-07T07:13:56Z" level=info msg="Established new database connection." source="postgres_exporter.go:1035" time="2019-03-07T07:13:56Z" level=info msg="Error while closing non-pinging DB connection: <nil>" source="postgres_exporter.go:1041" time="2019-03-07T07:13:56Z" level=info msg="Error opening connection to database (postgresql://:[email protected]:5432/postgres?sslmode=disable): pq: Could not detect default username. Please provide one explicitly" source="postgres_exporter.go:1070" time="2019-03-07T07:13:56Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1178"

my docker file is shown below:

` FROM ubuntu:18.04 as builder

ENV PG_EXPORTER_VERSION=v0.4.7 RUN apt-get update && apt-get install -y curl
&& curl -sL https://github.com/wrouesnel/postgres_exporter/releases/download/${PG_EXPORTER_VERSION}/postgres_exporter_${PG_EXPORTER_VERSION}_linux-amd64.tar.gz
| tar -xz

FROM scratch

ENV PG_EXPORTER_VERSION=v0.4.7 ENV POSTGRES_USER="" ENV POSTGRES_PASSWORD="" ENV DATA_SOURCE_NAME="postgresql://${POSTGRES_USER}:${POSTGRES_PASSWORD}@127.0.0.1:5432/postgres?sslmode=disable"

COPY --from=builder /postgres_exporter_${PG_EXPORTER_VERSION}_linux-amd64/postgres_exporter /postgres_exporter

EXPOSE 9187

ENTRYPOINT [ "/postgres_exporter" ]`

pitabwire avatar Mar 07 '19 18:03 pitabwire

I'm using sidecar to run postgres_exporter. The config look like this

apiVersion: "acid.zalan.do/v1"
kind: postgresql
spec:
    ...
    sidecars:
    - name: "prometheus-postgres-exporter"
      image: "wrouesnel/postgres_exporter:v0.4.7"
      env:
        - name: "PG_EXPORTER_EXTEND_QUERY_PATH"
          value: "/etc/config.yaml"
        - name: "DATA_SOURCE_NAME"
          value: "postgresql://postgres_exporter:password@localhost:5432/postgres?sslmode=disable"
      ports:
        - name: http
          containerPort: 9187
          protocol: TCP
    ...

Unfortunately, the endpoints don't expose the sidecar's port (9187 in this case)

tritruong avatar Mar 08 '19 08:03 tritruong

@tritruong the challange with doing it this way is you have to do it for every cluster definition, I would like to do it globally and in an automated way so that any new cluster definitions are automatically picked up by the prometheus monitor and alerting system.

pitabwire avatar Mar 09 '19 05:03 pitabwire

And dont put the password into env vars like this.

I am in general in favor of having global generic sidecar def. for whatever you need.

For monitoring though, or other tooling, the K8S API delivers you a nice way to discover services and clusters you want to monitor and the one exporter or tool per cluster may not be the best idea anymore. But this depends arguably.

Jan-M avatar Mar 12 '19 12:03 Jan-M

@Jan-M Yes, I could use mount secret file. Is there any way I could do to disable the default environment variables that always passed to sidecars (POSTGRES_USER and POSTGRES_PASSWORD)? https://github.com/zalando/postgres-operator/blob/31e568157b336592debbb37f2c44c1ca1769c00d/docs/user.md#sidecar-support

tritruong avatar Mar 12 '19 16:03 tritruong

@tritruong Maybe using a trust configuration with role-mapping in pg_hba.conf could grant the exporter sidecar just the required read-only access, potentially even without password-based authentication?

And yes @Jan-M, I believe @tritruong does have a point. Giving every little sidecar containing just a piece of monitoring software full on admin rights to the database might not be desired :-)

frittentheke avatar Mar 17 '19 19:03 frittentheke

Unfortunately, the endpoints don't expose the sidecar's port (9187 in this case)

@tritruong I created a separate service for the exporter to work around that fact.

rporres avatar Mar 18 '19 14:03 rporres

If any1 would be interested in monitoring of Patroni itself, I've written a patroni-exporter for prometheus that scrapes the Patroni API. Someone could find it useful :) https://github.com/Showmax/patroni-exporter

jtomsa avatar Apr 01 '19 09:04 jtomsa

Here is a complete example we use internaly to enable prometheus exporter:

---
apiVersion: acid.zalan.do/v1
kind: postgresql
metadata:
  name: postgres
spec:
  teamId: "myteam"
  numberOfInstances: 1
  enableMasterLoadBalancer: false
  volume:
    size: 200Mi
  users:
    user_database: ["superuser", "createdb"]
  databases:
    database: user_database
  postgresql:
    version: "11"

  sidecars:
    - name: "exporter"
      image: "wrouesnel/postgres_exporter"
      ports:
        - name: exporter
          containerPort: 9187
          protocol: TCP
      resources:
        limits:
          cpu: 500m
          memory: 256M
        requests:
          cpu: 100m
          memory: 200M
      env:
        - name: "DATA_SOURCE_URI"
          value: "postgres/database?sslmode=disable"
        - name: "DATA_SOURCE_USER"
          valueFrom:
            secretKeyRef:
              name: postgres.postgres.credentials
              key: username
        - name: "DATA_SOURCE_PASS"
          valueFrom:
            secretKeyRef:
              name: postgres.postgres.credentials
              key: password

---
apiVersion: v1
kind: Service
metadata:
  name: pg-exporter
  labels:
    app: pg-exporter
spec:
  ports:
    - name: postgres
      port: 5432
      targetPort: 5432
    - name: exporter
      port: 9187
      targetPort: exporter
  selector:
    application: spilo
    team: myteam

Yannig avatar Aug 02 '19 12:08 Yannig

I opted into baking postgres_exporter into a custom built Spilo image and have the supervisord in the Spilo image automatically start it up. Then I tweaked the Prometheus job rules to add a custom scrape target that scrapes the postgres_exporter metrics on all application=spilo pods - it seems to work quite well and lets me configure monitoring as an operator wide feature instead of having each cluster have to define this themselves.

tanordheim avatar Sep 25 '19 10:09 tanordheim

When we upgraded our Kubernetes cluster to 1.16 the postgres-operator (1.2.0, #674) was not able to find the existing StatefulSets anymore (because of the API changes between 1.15 and 1.16). This led to a situation where all postgres clusters were marked as SyncFailed.

Status:
  Postgres Cluster Status:  SyncFailed

I think it would be very helpful if the operator exposed a /metrics endpoint for Prometheus which would make it possible to alert on such things. This is not an issue if the database cluster but of the operator, so monitoring the database does not expose this kind of issue.

ekeih avatar Dec 24 '19 09:12 ekeih

@theRealWardo there are two PRs open, that combined should allow most monitoring / log-shipping use cases to be configured:

  • Fully speced sidecars: https://github.com/zalando/postgres-operator/pull/890
  • Additional Volumes: https://github.com/zalando/postgres-operator/pull/736 (i.e. to expose the PostgreSQL socket to postgres_exporter or some other tool)

frittentheke avatar Apr 01 '20 14:04 frittentheke

awesome thanks @frittentheke!

theRealWardo avatar Apr 04 '20 01:04 theRealWardo