percona-server-mongodb-operator icon indicating copy to clipboard operation
percona-server-mongodb-operator copied to clipboard

K8SPSMDB-1003: Kubernetes node zone/region tag

Open sergelogvinov opened this issue 1 year ago • 13 comments

K8SPSMDB-1003 Powered by Pull Request Badge

https://jira.percona.com/browse/K8SPSMDB-1003


Problem: To use read/write concern based on kubernetes zone/region.

Cause: For example, reading from a single zone can reduce latency, while writing to multiple zones enhances redundancy

Solution: Simple changes. We will read node property (if we have a right for it) and add tags to the node.

Plus, we need to add RBAC policy in helm chart too.

{{- if or .Values.watchNamespace .Values.watchAllNamespaces }}
  - apiGroups:
    - ""
    resources:
    - nodes
    verbs:
    - get
    - list
    - watch

Thanks.

CHECKLIST

Jira

  • [x] Is the Jira ticket created and referenced properly?
  • [ ] Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • [ ] Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • [ ] Is an E2E test/test case added for the new feature/change?
  • [ ] Are unit tests added where appropriate?
  • [ ] Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • [ ] Are all needed new/changed options added to default YAML files?
  • [ ] Are the manifests (crd/bundle) regenerated if needed?
  • [ ] Did we add proper logging messages for operator actions?
  • [ ] Did we ensure compatibility with the previous version or cluster upgrade process?
  • [ ] Does the change support oldest and newest supported MongoDB version?
  • [ ] Does the change support oldest and newest supported Kubernetes version?

sergelogvinov avatar Oct 12 '23 08:10 sergelogvinov

CLA assistant check
All committers have signed the CLA.

it-percona-cla avatar Oct 18 '23 19:10 it-percona-cla

@sergelogvinov are you willing to work on this further? Looking at the test results, I don't think it works right now but I think it's a useful feature. If you don't want to work on this further, we can take over.

egegunes avatar Jan 12 '24 09:01 egegunes

@sergelogvinov ping

egegunes avatar Jan 19 '24 09:01 egegunes

Hello, sorry for delay.

I did some tests on my application side with changes. And all works as expected. But I think we need more changes here.

I know some clouds which does not allow you to use clusterRole permission (only one namespace permission). So this feature should be as option (crd option).

The option proposal: If topologyPrimaryKey exists (and non empty) we will add labels to the mongo nodes.

# Try to set higher priority for nodes which zone = us-east-1a
topologyPrimaryPrefer: us-east-1a
# Can be kubernetes.io/hostname or topology.kubernetes.io/region or topology.kubernetes.io/zone
topologyPrimaryKey: kubernetes.io/zone

And it can be done with https://jira.percona.com/browse/K8SPSMDB-1002

What do you think?

sergelogvinov avatar Jan 21 '24 14:01 sergelogvinov

@sergelogvinov yes, namespace permission can be a problem since by default we don't use ClusterRole. So unless operator is deployed cluster-wide, this won't work. It'd be great if we can offer something for namespace scoped deployments too, what do you think @hors @spron-in ?

@sergelogvinov I think K8SPSMDB-1002 should be implemented in another PR, wdyt?

egegunes avatar Jan 26 '24 09:01 egegunes

@sergelogvinov yes, namespace permission can be a problem since by default we don't use ClusterRole. So unless operator is deployed cluster-wide, this won't work. It'd be great if we can offer something for namespace scoped deployments too, what do you think @hors @spron-in ?

@egegunes I think we can start from CW and then we will see.

hors avatar Feb 01 '24 16:02 hors

@sergelogvinov we'll start working on v1.16.0 in this month and if you want to have this we can assist you

egegunes avatar Feb 02 '24 09:02 egegunes

@sergelogvinov ping

egegunes avatar Mar 01 '24 09:03 egegunes

@egegunes Sorry for delay.

I've rebase the PR, check the cluster wide and namespaces deployment. It willn't fail if it does not have cluster role permission.

sergelogvinov avatar Mar 06 '24 12:03 sergelogvinov

I've checked the failed logs. Is it CI issue?

Thanks.

sergelogvinov avatar Mar 08 '24 07:03 sergelogvinov

@sergelogvinov I think we have problems with backups and restores because of this changes. I don't think it's just a CI issue

egegunes avatar Mar 08 '24 09:03 egegunes

@sergelogvinov I think we have problems with backups and restores because of this changes. I don't think it's just a CI issue

I've checked the logs/shell scripts and other PRs. Last PRs have the same error:

2024-03-08T14:38:20.000+0000 D [resync] bcp: 2024-03-08T14:37:40Z.pbm.json
2024-03-08T14:38:20.000+0000 W [resync] skip snapshot 2024-03-08T14:37:40Z: file "2024-03-08T14:37:40Z/shard1/oplog": no such file

I notice, that we run operator in cluster wide mode, so probably operator in another namespace affects our e2e tests. Can you check the CI cluster, please?

Thanks.

sergelogvinov avatar Mar 08 '24 19:03 sergelogvinov

@nmarukovich could you please check this

egegunes avatar Apr 04 '24 09:04 egegunes

Test name Status
arbiter passed
balancer passed
custom-replset-name passed
cross-site-sharded passed
data-at-rest-encryption passed
data-sharded passed
demand-backup passed
demand-backup-eks-credentials passed
demand-backup-physical passed
demand-backup-physical-sharded passed
demand-backup-sharded passed
expose-sharded passed
ignore-labels-annotations passed
init-deploy passed
finalizer passed
ldap passed
ldap-tls passed
limits passed
liveness passed
mongod-major-upgrade passed
mongod-major-upgrade-sharded passed
monitoring-2-0 passed
multi-cluster-service passed
non-voting passed
one-pod passed
operator-self-healing-chaos passed
pitr passed
pitr-sharded passed
pitr-physical passed
pvc-resize passed
recover-no-primary passed
rs-shard-migration passed
scaling passed
scheduled-backup passed
security-context passed
self-healing-chaos passed
service-per-pod passed
serviceless-external-nodes passed
smart-update passed
split-horizon passed
storage passed
tls-issue-cert-manager passed
upgrade passed
upgrade-consistency passed
upgrade-consistency-sharded-tls passed
upgrade-sharded passed
users passed
version-service passed
We run 48 out of 48

commit: https://github.com/percona/percona-server-mongodb-operator/pull/1360/commits/95c1888e9e0ea97de58e9b3bcc3901141d4d652e image: perconalab/percona-server-mongodb-operator:PR-1360-95c1888e

JNKPercona avatar Apr 24 '24 10:04 JNKPercona

@sergelogvinov thank you for your contribution

hors avatar Apr 24 '24 10:04 hors