percona-server-mongodb-operator K8SPSMDB-1003: Kubernetes node zone/region tag

https://jira.percona.com/browse/K8SPSMDB-1003

Problem: To use read/write concern based on kubernetes zone/region.

Cause: For example, reading from a single zone can reduce latency, while writing to multiple zones enhances redundancy

Solution: Simple changes. We will read node property (if we have a right for it) and add tags to the node.

Plus, we need to add RBAC policy in helm chart too.

{{- if or .Values.watchNamespace .Values.watchAllNamespaces }}
  - apiGroups:
    - ""
    resources:
    - nodes
    verbs:
    - get
    - list
    - watch

Thanks.

CHECKLIST

Jira

[x] Is the Jira ticket created and referenced properly?
[ ] Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
[ ] Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

[ ] Is an E2E test/test case added for the new feature/change?
[ ] Are unit tests added where appropriate?
[ ] Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

[ ] Are all needed new/changed options added to default YAML files?
[ ] Are the manifests (crd/bundle) regenerated if needed?
[ ] Did we add proper logging messages for operator actions?
[ ] Did we ensure compatibility with the previous version or cluster upgrade process?
[ ] Does the change support oldest and newest supported MongoDB version?
[ ] Does the change support oldest and newest supported Kubernetes version?

Oct 12 '23 08:10 sergelogvinov

All committers have signed the CLA.

Oct 18 '23 19:10 it-percona-cla

@sergelogvinov are you willing to work on this further? Looking at the test results, I don't think it works right now but I think it's a useful feature. If you don't want to work on this further, we can take over.

Jan 12 '24 09:01 egegunes

@sergelogvinov ping

Jan 19 '24 09:01 egegunes

Hello, sorry for delay.

I did some tests on my application side with changes. And all works as expected. But I think we need more changes here.

I know some clouds which does not allow you to use clusterRole permission (only one namespace permission). So this feature should be as option (crd option).

The option proposal: If topologyPrimaryKey exists (and non empty) we will add labels to the mongo nodes.

# Try to set higher priority for nodes which zone = us-east-1a
topologyPrimaryPrefer: us-east-1a
# Can be kubernetes.io/hostname or topology.kubernetes.io/region or topology.kubernetes.io/zone
topologyPrimaryKey: kubernetes.io/zone

And it can be done with https://jira.percona.com/browse/K8SPSMDB-1002

What do you think?

Jan 21 '24 14:01 sergelogvinov

@sergelogvinov yes, namespace permission can be a problem since by default we don't use ClusterRole. So unless operator is deployed cluster-wide, this won't work. It'd be great if we can offer something for namespace scoped deployments too, what do you think @hors @spron-in ?

@sergelogvinov I think K8SPSMDB-1002 should be implemented in another PR, wdyt?

Jan 26 '24 09:01 egegunes

@sergelogvinov yes, namespace permission can be a problem since by default we don't use ClusterRole. So unless operator is deployed cluster-wide, this won't work. It'd be great if we can offer something for namespace scoped deployments too, what do you think @hors @spron-in ?

@egegunes I think we can start from CW and then we will see.

Feb 01 '24 16:02 hors

@sergelogvinov we'll start working on v1.16.0 in this month and if you want to have this we can assist you

Feb 02 '24 09:02 egegunes

@sergelogvinov ping

Mar 01 '24 09:03 egegunes

@egegunes Sorry for delay.

I've rebase the PR, check the cluster wide and namespaces deployment. It willn't fail if it does not have cluster role permission.

Mar 06 '24 12:03 sergelogvinov

I've checked the failed logs. Is it CI issue?

Thanks.

Mar 08 '24 07:03 sergelogvinov

@sergelogvinov I think we have problems with backups and restores because of this changes. I don't think it's just a CI issue

Mar 08 '24 09:03 egegunes

@sergelogvinov I think we have problems with backups and restores because of this changes. I don't think it's just a CI issue

I've checked the logs/shell scripts and other PRs. Last PRs have the same error:

2024-03-08T14:38:20.000+0000 D [resync] bcp: 2024-03-08T14:37:40Z.pbm.json
2024-03-08T14:38:20.000+0000 W [resync] skip snapshot 2024-03-08T14:37:40Z: file "2024-03-08T14:37:40Z/shard1/oplog": no such file

I notice, that we run operator in cluster wide mode, so probably operator in another namespace affects our e2e tests. Can you check the CI cluster, please?

Thanks.

Mar 08 '24 19:03 sergelogvinov

@nmarukovich could you please check this

Apr 04 '24 09:04 egegunes

Test name Status

arbiter passed

balancer passed

custom-replset-name passed

cross-site-sharded passed

data-at-rest-encryption passed

data-sharded passed

demand-backup passed

demand-backup-eks-credentials passed

demand-backup-physical passed

demand-backup-physical-sharded passed

demand-backup-sharded passed

expose-sharded passed

ignore-labels-annotations passed

init-deploy passed

finalizer passed

ldap passed

ldap-tls passed

limits passed

liveness passed

mongod-major-upgrade passed

mongod-major-upgrade-sharded passed

monitoring-2-0 passed

multi-cluster-service passed

non-voting passed

one-pod passed

operator-self-healing-chaos passed

pitr passed

pitr-sharded passed

pitr-physical passed

pvc-resize passed

recover-no-primary passed

rs-shard-migration passed

scaling passed

scheduled-backup passed

security-context passed

self-healing-chaos passed

service-per-pod passed

serviceless-external-nodes passed

smart-update passed

split-horizon passed

storage passed

tls-issue-cert-manager passed

upgrade passed

upgrade-consistency passed

upgrade-consistency-sharded-tls passed

upgrade-sharded passed

users passed

version-service passed

We run 48 out of 48

Test name	Status
arbiter	passed
balancer	passed
custom-replset-name	passed
cross-site-sharded	passed
data-at-rest-encryption	passed
data-sharded	passed
demand-backup	passed
demand-backup-eks-credentials	passed
demand-backup-physical	passed
demand-backup-physical-sharded	passed
demand-backup-sharded	passed
expose-sharded	passed
ignore-labels-annotations	passed
init-deploy	passed
finalizer	passed
ldap	passed
ldap-tls	passed
limits	passed
liveness	passed
mongod-major-upgrade	passed
mongod-major-upgrade-sharded	passed
monitoring-2-0	passed
multi-cluster-service	passed
non-voting	passed
one-pod	passed
operator-self-healing-chaos	passed
pitr	passed
pitr-sharded	passed
pitr-physical	passed
pvc-resize	passed
recover-no-primary	passed
rs-shard-migration	passed
scaling	passed
scheduled-backup	passed
security-context	passed
self-healing-chaos	passed
service-per-pod	passed
serviceless-external-nodes	passed
smart-update	passed
split-horizon	passed
storage	passed
tls-issue-cert-manager	passed
upgrade	passed
upgrade-consistency	passed
upgrade-consistency-sharded-tls	passed
upgrade-sharded	passed
users	passed
version-service	passed
We run 48 out of 48

commit: https://github.com/percona/percona-server-mongodb-operator/pull/1360/commits/95c1888e9e0ea97de58e9b3bcc3901141d4d652e image: perconalab/percona-server-mongodb-operator:PR-1360-95c1888e

Apr 24 '24 10:04 JNKPercona

@sergelogvinov thank you for your contribution

Apr 24 '24 10:04 hors

percona-server-mongodb-operator percona-server-mongodb-operator copied to clipboard

K8SPSMDB-1003: Kubernetes node zone/region tag

CHECKLIST

percona-server-mongodb-operator
percona-server-mongodb-operator copied to clipboard