charts icon indicating copy to clipboard operation
charts copied to clipboard

allow more than 1 PgBouncer replicas

Open low-on-mana opened this issue 2 years ago • 5 comments

Checks

Chart Version

latest

Kubernetes Version

NA

Helm Version

NA

Description

We are using the latest version of this chart in production for airflow 2.3.0 ( we did this migration few days back ).

One of the issues we faced is related to pgbouncer. What happened was K8 rescheduled the pgbouncer pod to another node, since there is only 1 pod running we had one task failure which we had to retry manually later.

We can have safe_to_evict false or pod disruption budget as another solution but best would be to make pgbouncer HA by using multi pods.

Can we have 2 pods for HA pgbouncer ?

spec:
  replicas: 1
  strategy:
    rollingUpdate:
      ## multiple pgbouncer pods can safely run concurrently

https://github.com/airflow-helm/charts/blob/420eae29c454f6e7e6a7837706ca2e6c0fe792b8/charts/airflow/templates/pgbouncer/pgbouncer-deployment.yaml#L24

Relevant Logs

No response

Custom Helm Values

No response

low-on-mana avatar Jul 04 '22 11:07 low-on-mana

@low-on-mana is this really safe to use in an Airflow environment? I was wondering about the same actually, to have some kind of backup if one PgBouncer replica fails (during k8s node patching or whatever). Official chart also uses a hardcoded replicas: 1.

I've tried to understand how can multiple PgBouncer replicas affect the deployment (connections to DB etc.) but didn't find any suitable links, tutorials, nothing.. explaining this multi-replica PgBouncer thing.

Would it also require to customize values such as maxClientConnections and poolSize? E.g. you set replicas to 3 then you would need to customize these values accordingly (divide by 3?).

Anyone who has any experience in this?

jurovee avatar Jul 04 '22 13:07 jurovee

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label

stale[bot] avatar Sep 05 '22 17:09 stale[bot]

@low-on-mana @juroVee I agree that having multiple PgBouncer replicas would be (in theory) great for redundancy, especially during node outages/upgrades, the problem is that any disruption to the database connection during a transaction will result in airflow raising an error, which I doubt airflow will gracefully recover from.

(NOTE: airflow uses SQLAlchemy in "pessimistic" pooling mode with the pre-ping approach, which can't handle mid-transaction failures)

That is to say, more PgBouncer replicas actually increases the possiblity of airflow trying to use a connection to a PgBouncer Pod that is no longer active (and crashing as a result).

We would need to investigate getting airflow to use a different SQLAlchemy pooling mode (to allow mid-transaction failures to be resolved gracefully) before we can increase PgBouncer replicas.

thesuperzapper avatar Sep 13 '22 01:09 thesuperzapper

@thesuperzapper Forgive me but why do you say higher "PgBouncer replicas actually increases the possibility of airflow trying to use a[n inactive] connection?"

I'm chasing HA on this particular component also, and want to understand the risk you're describing.

waldoppper avatar Sep 27 '22 13:09 waldoppper

This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.

Thank you for your contributions.


Issues never become stale if any of the following is true:

  1. they are added to a Project
  2. they are added to a Milestone
  3. they have the lifecycle/frozen label

stale[bot] avatar Nov 26 '22 15:11 stale[bot]