charts
charts copied to clipboard
allow more than 1 PgBouncer replicas
Checks
- [X] I have checked for existing issues.
- [X] This report is about the
User-Community Airflow Helm Chart
.
Chart Version
latest
Kubernetes Version
NA
Helm Version
NA
Description
We are using the latest version of this chart in production for airflow 2.3.0 ( we did this migration few days back ).
One of the issues we faced is related to pgbouncer. What happened was K8 rescheduled the pgbouncer pod to another node, since there is only 1 pod running we had one task failure which we had to retry manually later.
We can have safe_to_evict false or pod disruption budget as another solution but best would be to make pgbouncer HA by using multi pods.
Can we have 2 pods for HA pgbouncer ?
spec:
replicas: 1
strategy:
rollingUpdate:
## multiple pgbouncer pods can safely run concurrently
https://github.com/airflow-helm/charts/blob/420eae29c454f6e7e6a7837706ca2e6c0fe792b8/charts/airflow/templates/pgbouncer/pgbouncer-deployment.yaml#L24
Relevant Logs
No response
Custom Helm Values
No response
@low-on-mana is this really safe to use in an Airflow environment? I was wondering about the same actually, to have some kind of backup if one PgBouncer replica fails (during k8s node patching or whatever). Official chart also uses a hardcoded replicas: 1.
I've tried to understand how can multiple PgBouncer replicas affect the deployment (connections to DB etc.) but didn't find any suitable links, tutorials, nothing.. explaining this multi-replica PgBouncer thing.
Would it also require to customize values such as maxClientConnections and poolSize? E.g. you set replicas to 3 then you would need to customize these values accordingly (divide by 3?).
Anyone who has any experience in this?
This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.
Issues never become stale if any of the following is true:
- they are added to a Project
- they are added to a Milestone
- they have the
lifecycle/frozen
label
@low-on-mana @juroVee I agree that having multiple PgBouncer replicas would be (in theory) great for redundancy, especially during node outages/upgrades, the problem is that any disruption to the database connection during a transaction will result in airflow raising an error, which I doubt airflow will gracefully recover from.
(NOTE: airflow uses SQLAlchemy in "pessimistic" pooling mode with the pre-ping approach, which can't handle mid-transaction failures)
That is to say, more PgBouncer replicas actually increases the possiblity of airflow trying to use a connection to a PgBouncer Pod that is no longer active (and crashing as a result).
We would need to investigate getting airflow to use a different SQLAlchemy pooling mode (to allow mid-transaction failures to be resolved gracefully) before we can increase PgBouncer replicas.
@thesuperzapper Forgive me but why do you say higher "PgBouncer replicas actually increases the possibility of airflow trying to use a[n inactive] connection?"
I'm chasing HA on this particular component also, and want to understand the risk you're describing.
This issue has been automatically marked as stale because it has not had activity in 60 days. It will be closed in 7 days if no further activity occurs.
Thank you for your contributions.
Issues never become stale if any of the following is true:
- they are added to a Project
- they are added to a Milestone
- they have the
lifecycle/frozen
label