spilo icon indicating copy to clipboard operation
spilo copied to clipboard

Seg Fault in bg_mon module on RHEL 9.4

Open renedamyon opened this issue 1 year ago • 9 comments

We're seeing repeated segmentation faults in the bg_mon sub module when trying to deploy postgres 16 via postgres-operator into kubernetes clusters using RHEL 9.4 images as the node OS causing postgres to get stuck in a recovery loop.

https://github.com/CyberDem0n/bg_mon

Issue occured when upgrading our postgres clusters from postgres 15 using docker image: ghcr.io/zalando/spilo-15:3.0-p1 to 16 and also with new installs.

Confirmed issue doesn't occur on RHEL 9.4 using ghcr.io/zalando/spilo-15:3.0-p1 and also doesn't seem to occur on RHEL 9.3 or 8.10

Environment

  • ghcr.io/zalando/postgres-operator:v1.12.0,
  • Kubernetes, seen on kURL and Rancher clusters with nodes running RHEL 9.4. Tested using RHEL 9.4 AMIs on AWS and seen with several customer installs on the mentioned distributions. I assume we'll see the same thing on other distros as well.
  • ghcr.io/zalando/spilo-16:3.2-p3 also tested ghcr.io/zalando/spilo-16:3.3-p2 with the same effect.
  • Velero is running in the clusters but not automatically backing up, my suspicion was that it was somehow causing the issue as it seems to occur less when removing velero pod annotations but still seems to occur occasionally.
  • SELinux was disabled on the host

Added a gist with the coredump stacktrace, our postgres-operator postgresql CR and some sample logs when the segfault first appears in pg_log. https://gist.github.com/renedamyon/6130ad4dd65edfbaeae6a43717f3adc2

Not really sure how to resolve this. I've attempted to isolate the configuration triggering this as it occurs always in initial installs of our environments but only occurs most of the time when deploying multiple postgres clusters in the same environment or a cluster with cut down configuration, it doesn't seems to be triggered by one specific thing I've spotted so far and doesn't appear at a consistent point in the logs to indicate something specific postgres was running triggered it.

Is bg_mon an essential module for spilo, is there any impact to removing it from the shared_preload_libraries list?

2024-09-27 12:11:16.137 UTC,,,72,,66f6a0b9.48,7,,2024-09-27 12:10:33 UTC,,0,LOG,00000,"background worker ""bg_mon"" (PID 78) was terminated by signal 11: Segmentation fault",,,,,,,,,"","postmaster",,0

Thanks for any assistance, Rene

renedamyon avatar Oct 17 '24 13:10 renedamyon

Still hoping for a response one this. What is the impact to the cluster of removing the bg_mon module?

renedamyon avatar Nov 05 '24 14:11 renedamyon

I think I have a similar issue with container image ghcr.io/zalando/spilo-17:4.0-p2 on a Zalando PGO operated cluster.

It is quite reproducible by means of starting a new cluster without importing any data and just running:

psql

\l

until fails with error:

postgres=# \l
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

ErmakovDmitriy avatar Apr 17 '25 09:04 ErmakovDmitriy

I just hit bg_mon segfaulting consistently in a loop as well

postgres log: /home/postgres/pgdata/pgroot/pg_log/postgresql-6.csv:2025-04-19 02:03:03.466 UTC,,,52,,6802f4f1.34,8,,2025-04-19 00:57:21 UTC,,0,LOG,00000,"background worker ""bg_mon"" (PID 58) was terminated by signal 11: Segmentation fault",,,,,,,,,"","postmaster",,0

For me bg_mon segfault crash loop started around when i added a second database to my cluster

temporary workaround

kill -SIGSTOP $(pgrep -f '.*bg_mon'). Aka you can prevent the crash by just freezing the process...

For me this was a big relief because bg_mon will always crash a couple seconds after starting putting my postres master.

Context

I am running k8s with ghcr.io/zalando/spilo-17:4.0-p2 — this spilo-17 but the issue started with a spilo-15 or spilo-16 — upgrading did not change the issue for me.

This is my cluster manifest:

kind: "postgresql"
apiVersion: "acid.zalan.do/v1"

metadata:
  namespace: "directus"
  labels:
    team: acid
  name: directus-pg
spec:
  teamId: "acid"
  postgresql:
    version: "16"
    #parameters:
      # by default SHOW shared_preload_libraries returns:
      # bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache
      # dropping bg_mon as causes Segmentation Fault...
      #shared_preload_libraries: "pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcachee"
  numberOfInstances: 2
  enableMasterLoadBalancer: false
  enableReplicaLoadBalancer: false
  enableConnectionPooler: false
  enableReplicaConnectionPooler: false
  enableMasterPoolerLoadBalancer: false
  enableReplicaPoolerLoadBalancer: false
  #maintenanceWindows:
  volume:
    size: "50Gi"
    storageClass: "hcloud-volumes"
  users:
    directus: []
  databases:
    hdd: directus
  #allowedSourceRanges:
  # IP ranges to access your cluster go here
  preparedDatabases:
    kratos:
      defaultUsers: true
  resources:
    requests:
      cpu: 2000m
      memory: 2000Mi
    limits:
      cpu: 3000m
      memory: 3000Mi

i tried removing bg_mon from shared_preload_libraries but i did not manage to get that to work. Seems to be a subtle but long running issue: https://github.com/CyberDem0n/bg_mon/issues/19

I figured to post the issue here and not in https://github.com/CyberDem0n/bg_mon/issues/68 because it might just be simpler to ditch a buggy bg_mon than to fix an old c lib.

If you know a way how to turn bg_mon off that would be very much welcome.

The crash loop started around when I added:

 preparedDatabases:
    kratos:
      defaultUsers: true

to my manifest - the db was created along with its users though.

Davidiusdadi avatar Apr 20 '25 11:04 Davidiusdadi

Since this bug was a show stopper for me I ended up to migrating from zalando postgres operator to https://cloudnative-pg.io/ . Via SIGSTOP on bg_mon i could get the old pg master to "not crash loop" and pull the final pg_dumps. — Easter Weekend Complete

Davidiusdadi avatar Apr 20 '25 23:04 Davidiusdadi

I got hit with this today. I have two independent databases with identical configuration, but only one of them ended up in database reboot loop.

I managed to restore the DB to working mode by changing the configuration like below:

spec:
  postgresql:
    version: "16"
    parameters:
      # Overrride shared_preload_libraries to remove bg_mon which triggers crashes
      shared_preload_libraries: pg_stat_statements,pgextwlist,pg_auth_mon,set_user,timescaledb,pg_cron,pg_stat_kcache

The important part is just removing bg_mon from shared_preload_libraries

The configuration of shared_preload_libraries seems to work, thank you. Unfortunately, I tried and could not find a way to define this at the OperatorConfiguration level.

ErmakovDmitriy avatar Apr 22 '25 12:04 ErmakovDmitriy

This is also an issue on others like CentOS 10 (stream), OpenSUSE 15 (maybe Debian 11 or 12 - didn't try) - I think it'd be better to remove bg_mon from the spilo images all together.

bbedward avatar Jun 04 '25 14:06 bbedward

@CyberDem0n , if it helps your investigation, note that simply defining the list of shared_preload_libraries in the postgres opeartor solves it, even if the list is the same, and it includes the bg_mon... For us it crashes only on this combination: PG 17 + RH9.4 on physical machines. It works fine in all other combunarions: PG 16 + bare metal RH9.4 PG 17 + VMware RH9.4

sciornei-dox avatar Nov 03 '25 11:11 sciornei-dox

Ran into this today. With https://github.com/CyberDem0n/bg_mon/pull/69 being fixed in bg_mon can the version used in Spilo be updated to include it?

manderson23 avatar Dec 02 '25 14:12 manderson23