mongodb-atlas-kubernetes icon indicating copy to clipboard operation
mongodb-atlas-kubernetes copied to clipboard

AtlasDatabaseUser - message - unable to list: test because of unknown namespace for the cache

Open qtranton opened this issue 1 year ago • 10 comments

Have a version of operator 1.7.1 and decide to upgrade to the latest in cluster. Create local env

  1. k8s - by docker desktop v 1.25.4
  2. operator v1.7.1
  3. Add AtlasDeployment and AtlasDatabaseUser
  4. Upgrade to v2.2.0 ( helm upgrade crd then upgrade operator )
  5. Fix AtlasDeployment
  6. Check logs of operator get error aka

{"level":"INFO","time":"2024-04-16T12:12:14.543Z","msg":"Status update","atlasdatabaseuser":"test/operator-upgrade-test","lastCondition":{"type":"DatabaseUserReady","status":"False","lastTransitionTime":null,"reason":"DatabaseUserStaleConnectionSecrets","message":"unable to list: test because of unknown namespace for the cache"}}

What did you expect? After all step operator just should work as expected

What happened instead? AtlasDatabaseUser status always in False state

Operator Information

  • 1.7.1 -> 2.2.0

Kubernetes Cluster Information

  • Docker Desktop
  • 1.25.4

Additional context Try to figure out why AtlasDatabaseUser CRD failed. It's created proper secrets and creates users in AtlasUI but CRD itself always in Ready - False state

status: conditions: - lastTransitionTime: "2024-04-16T12:03:17Z" status: "False" type: Ready - lastTransitionTime: "2024-04-16T11:44:08Z" status: "True" type: ResourceVersionIsValid - lastTransitionTime: "2024-04-16T11:44:08Z" status: "True" type: ValidationSucceeded - lastTransitionTime: "2024-04-16T12:03:18Z" message: 'unable to list: test because of unknown namespace for the cache' reason: DatabaseUserStaleConnectionSecrets status: "False" type: DatabaseUserReady

If possible, please include:

{"level":"DEBUG","time":"2024-04-16T12:17:12.709Z","msg":"Ensured connection Secret up-to-date","atlasdatabaseuser":"test/operator-upgrade-test","secretname":"HIDDEN"} {"level":"INFO","time":"2024-04-16T12:17:12.709Z","msg":"Status update","atlasdatabaseuser":"test/operator-upgrade-test-","lastCondition":{"type":"DatabaseUserReady","status":"False","lastTransitionTime":null,"reason":"DatabaseUserStaleConnectionSecrets","message":"unable to list: test because of unknown namespace for the cache"}}

qtranton avatar Apr 16 '24 12:04 qtranton

Thanks for reporting this issue @qtranton !

Could you give us a minimum YAML sample we could use to reproduce the issue? Does not need to be your original complete setup, just the definitions that reproduce the same failure.

josvazg avatar Apr 17 '24 13:04 josvazg

Sure, i have cleanup i guess my yaml here

apiVersion: v1
kind: Secret
metadata:
  labels:
    app: operator-upgrade
    atlas.mongodb.com/type: credentials
    env: dev
  name: operator-upgrade-test
  namespace: test
stringData:
  password: testpassword


---
# Source: app-resources/templates/mongodb_atlas.yaml
apiVersion: atlas.mongodb.com/v1
kind: AtlasBackupPolicy
metadata:
  name: operator-upgrade-test
  namespace: test
  annotations:
    mongodb.com/atlas-resource-policy: "keep"
spec:
  items: 
    - frequencyInterval: 12
      frequencyType: hourly
      retentionUnit: days
      retentionValue: 1
    - frequencyInterval: 1
      frequencyType: daily
      retentionUnit: days
      retentionValue: 7
    - frequencyInterval: 6
      frequencyType: weekly
      retentionUnit: weeks
      retentionValue: 1
    - frequencyInterval: 40
      frequencyType: monthly
      retentionUnit: months
      retentionValue: 1
---
# Source: app-resources/templates/mongodb_atlas.yaml
apiVersion: atlas.mongodb.com/v1
kind: AtlasBackupSchedule
metadata:
  name: operator-upgrade-test
  namespace: test
  annotations:
    mongodb.com/atlas-resource-policy: "keep"
spec:
  autoExportEnabled: false
  referenceHourOfDay: 21
  referenceMinuteOfHour: 2
  policy:
    name: operator-upgrade-test
    namespace: test
---
# Source: app-resources/templates/mongodb_atlas.yaml
apiVersion: atlas.mongodb.com/v1
kind: AtlasDatabaseUser
metadata:
  name: operator-upgrade-test
  labels:
    app: "operator-upgrade"
    env: dev
  #   mongodb.com/atlas-resource-policy: "keep"
spec:
  roles:
  - roleName: readWrite
    databaseName: Application
  scopes:
  - type: CLUSTER
    name: operator-upgrade-test
  projectRef:
    name: project-name
    namespace: mongodb-operator
  username: operator-upgrade-test
  databaseName: admin
  passwordSecretRef:
    name: "operator-upgrade-test"

---
# Source: app-resources/templates/mongodb_atlas.yaml
apiVersion: atlas.mongodb.com/v1
kind: AtlasDeployment
metadata:
  name: operator-upgrade-test
  namespace: test
  labels:
    app: "operator-upgrade"
    env: dev
  # annotations:
  #   mongodb.com/atlas-resource-policy: "keep"
spec:
  backupRef:
    name: operator-upgrade-test
    namespace: test
  projectRef:
    name: project-name
    namespace: mongodb-operator
  advancedDeploymentSpec:
    mongoDBMajorVersion: "6.0"
    clusterType: REPLICASET
    backupEnabled: true
    pitEnabled: false
    name: operator-upgrade-test
    replicationSpecs:
      - regionConfigs:
        - electableSpecs:
              instanceSize: M10
              nodeCount: 3
          providerName: GCP
          backingProviderName: GCP
          regionName: "EASTERN_US"
          # Priority description https://www.mongodb.com/docs/atlas/reference/atlas-operator/atlasdeployment-custom-resource/#mongodb-setting-spec.advancedDeploymentSpec.replicationSpecs.regionConfigs.priority
          priority: 7
          autoScaling:
            compute:
              enabled: false

qtranton avatar Apr 18 '24 09:04 qtranton

cc @roothorp

s-urbaniak avatar Apr 19 '24 07:04 s-urbaniak

@qtranton can you check if you happen to have the WATCH_NAMESPACE environment variable set for your operator deployment? i.e. if you could submit the output of kubectl -n <operator_namespace> get pod <operator_name> here?

s-urbaniak avatar Apr 26 '24 11:04 s-urbaniak

i.e. it looks like the test namespace is not being listened by the operator, overriden by the WATCH_NAMESPACE env variable.

s-urbaniak avatar Apr 26 '24 11:04 s-urbaniak

In helm i see this

{{- if .Values.watchNamespaces }}
          - name: WATCH_NAMESPACE
            value: "{{ join "," .Values.watchNamespaces }}"
          {{- end }}

So i have check pod and

 Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      OPERATOR_POD_NAME:   mongodb-atlas-operator-5df9ff6978-tqznx (v1:metadata.name)
      OPERATOR_NAMESPACE:  mongodb-operator (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4kgq7 (to)

qtranton avatar Apr 29 '24 08:04 qtranton

In roles i also see some mention of this variable, but since it empty no additional roles was created

mongodb-operator   mongodb-atlas-operator                           
mongodb-operator   mongodb-atlas-operator-leader-election-role      

Plus it works on older version so older version could read secrets i guess

qtranton avatar Apr 29 '24 08:04 qtranton

Validate secrets as well When remove labels

atlas.mongodb.com/type: credentials

Get error like

"msg":"Status update","atlasdatabaseuser":"tester/operator-upgrade-test","lastCondition":{"type":"DatabaseUserReady","status":"False","lastTransitionTime":null,"reason":"InternalError","message":"Secret \"operator-upgrade-test\" not found"}}

Back labels in place get error

"msg":"Status update","atlasdatabaseuser":"tester/operator-upgrade-test","lastCondition":{"type":"DatabaseUserReady","status":"False","lastTransitionTime":null,"reason":"DatabaseUserStaleConnectionSecrets","message":"unable to list: tester because of unknown namespace for the cache"}}

qtranton avatar Apr 30 '24 12:04 qtranton

@josvazg @s-urbaniak hey have some time to debug issue, so on my local cluster for some reason on version 2.2.2 i do not see status.name parameters. Just put a lot of println in local branch :D

    #############################
    operator-upgrade-test
    cleanupStaleSecrets: Failed to list connection Secrets 
    ############################# 

To

if user.Status.UserName != user.Spec.Username {
		// Note, that we pass the username from the status, not from the spec
		fmt.Println("#############################")
		fmt.Println(user.Status.UserName, user.Spec.Username)
		fmt.Println("cleanupStaleSecrets: Failed to list connection Secrets")
		fmt.Println("#############################")
		return RemoveStaleSecretsByUserName(ctx.Context, k8sClient, projectID, user.Status.UserName, user, ctx.Log)
	}

Here https://github.com/mongodb/mongodb-atlas-kubernetes/blob/main/pkg/controller/connectionsecret/connectionsecrets.go#L126 Now i try figure out why i have error related to secret if user not set Meanwhile CRD look like that :

status:
    conditions:
    - lastTransitionTime: "2024-05-27T11:30:38Z"
      status: "False"
      type: Ready
    - lastTransitionTime: "2024-05-27T11:30:38Z"
      status: "True"
      type: ResourceVersionIsValid
    - lastTransitionTime: "2024-05-27T11:30:38Z"
      status: "True"
      type: ValidationSucceeded
    - lastTransitionTime: "2024-05-27T11:30:39Z"
      message: 'unable to list: tester because of unknown namespace for the cache'
      reason: DatabaseUserStaleConnectionSecrets
      status: "False"
      type: DatabaseUserReady
    observedGeneration: 1
    passwordVersion: "3017702"

qtranton avatar May 27 '24 13:05 qtranton

Update: Recheck on v 1.7 and name in status appear

qtranton avatar May 27 '24 13:05 qtranton

Thanks for your reports. I managed to reproduce the same. I am debugging it now.

josvazg avatar May 27 '24 17:05 josvazg

Seems we found the issue, we are working on a fix.

In the meantime, you could pass the list of namespaces you want to get checked. ie:

helm install ... --set watchNamespaces=test,...

josvazg avatar May 27 '24 18:05 josvazg

@josvazg on local machine yeah, but for main cluster we have too much namespace :) i will wait, not so critical

qtranton avatar May 28 '24 09:05 qtranton

BTW this #1619 already fixes the issue but it includes unrelated refactors. I am working on a specific test to cover this bug which was not previously detected by our test suite.

josvazg avatar May 29 '24 09:05 josvazg

I will check build locally then :)

qtranton avatar May 30 '24 13:05 qtranton

@josvazg jfyi

{"level":"ERROR","time":"2024-05-30T14:00:05.322Z","msg":"LeaderElectionID must be configuredunable to start operator"}

Get this error now

qtranton avatar May 30 '24 14:05 qtranton

@josvazg jfyi

{"level":"ERROR","time":"2024-05-30T14:00:05.322Z","msg":"LeaderElectionID must be configuredunable to start operator"}

Get this error now

I do not think this is related. BTW this PR #1621 should fix the original issue.

As for this new error, do you have a sample to reproduce it?

josvazg avatar May 31 '24 14:05 josvazg

@josvazg just build and put docker container to helm chart 2.2.2 nothing change from in deployment

qtranton avatar May 31 '24 15:05 qtranton

@josvazg After few additional crd ( not in upstream yet :D ) user status becomes true. We will do some additional tests according to our infra. Maybe you know when will it be released?

qtranton avatar Jun 03 '24 14:06 qtranton

@josvazg After few additional crd ( not in upstream yet :D ) user status becomes true. We will do some additional tests according to our infra. Maybe you know when will it be released?

We are aiming for a release soon, maybe this week. I should be merging PR #1621 tomorrow

josvazg avatar Jun 03 '24 16:06 josvazg