hierarchical-namespaces icon indicating copy to clipboard operation
hierarchical-namespaces copied to clipboard

Resources deleted when HNC restarts despite existing previously

Open Alan01252 opened this issue 3 years ago • 9 comments

When HNC restarts the propagated objects can be deleted unnecessarily if the parent namespace hasn't yet fully synced. This is problematic especially when propagating network rules as can lead to loss of service whilst the network rules are re-created.

Digging through the code the problem appears to be that the ActivitiesHalted flag is removed too early. It's removed when the parent namespace is added to the forest, not when all the objects inside the parent namespace have been synced.

I'm been trying to think of ways to fix this in the code but have come up short.

Hoping someone here can please advise!

version: 1.0.0

Alan01252 avatar Sep 23 '22 10:09 Alan01252

This is an example log line:

{"level":"info","ts":1663928171.3328152,"logger":"hierarchyconfig.reconcile","msg":"Setting ConditionActivitiesHalted: parent doesn't exist (or hasn't been synced yet)","rid":4,"ns":"acp-shrd-srv-service-wsipprovisioningmediator","parent":"acp-shrd"} {"level":"info","ts":1663928171.333898,"logger":"hierarchyconfig.reconcile","msg":"Setting ConditionActivitiesHalted: parent doesn't exist (or hasn't been synced yet)","rid":8,"ns":"acp-shrd-srv-service-helloworld","parent":"acp-shrd"} {"level":"info","ts":1663928171.3342464,"logger":"hierarchyconfig.reconcile","msg":"Setting ConditionActivitiesHalted: parent doesn't exist (or hasn't been synced yet)","rid":9,"ns":"acp-shrd-srv-service-signalrsubscription","parent":"acp-shrd"} {"level":"info","ts":1663928171.33828,"logger":"hierarchyconfig.reconcile","msg":"Setting ConditionActivitiesHalted: parent doesn't exist (or hasn't been synced yet)","rid":11,"ns":"acp-shrd-srv-service-auth","parent":"acp-shrd"} {"level":"info","ts":1663928171.3384438,"logger":"hierarchyconfig.reconcile","msg":"New namespace found","rid":11,"ns":"acp-shrd-srv-service-auth"} {"level":"info","ts":1663928171.3384647,"logger":"hierarchyconfig.reconcile","msg":"Namespace's managed and/or tree labels have been updated","rid":11,"ns":"acp-shrd-srv-service-auth"} {"level":"info","ts":1663928171.3384745,"logger":"hierarchyconfig.reconcile","msg":"Setting ActivitiesHalted on namespace","rid":11,"ns":"acp-shrd-srv-service-auth","conditions":[{"type":"ActivitiesHalted","status":"True","lastTransitionTime":"1970-01-01T00:00:00Z","reason":"ParentMissing","message":"Parent \"acp-shrd\" does not exist"}]} {"level":"info","ts":1663928173.433423,"logger":"RoleBinding.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":37,"trigger":"acp-shrd-srv-service-auth/EKS-Shared-Admin","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663928173.8340063,"logger":"Provider.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":87,"trigger":"acp-shrd-srv-service-auth/flux-system-alertmanager-0","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663928173.8358054,"logger":"RoleBinding.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":98,"trigger":"acp-shrd-srv-service-auth/EKS-Shared-Read","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663928173.9322064,"logger":"Provider.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":108,"trigger":"acp-shrd-srv-service-auth/flux-system-alertmanager-0","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663928175.136427,"logger":"RoleBinding.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":156,"trigger":"acp-shrd-srv-service-auth/EKS-Shared-Read","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663928175.1365788,"logger":"RoleBinding.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":152,"trigger":"acp-shrd-srv-service-auth/EKS-Shared-Admin","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663928175.7343497,"logger":"hierarchyconfig.reconcile","msg":"New namespace found","rid":24,"ns":"acp-shrd"} {"level":"info","ts":1663928175.7343872,"logger":"hierarchyconfig.reconcile","msg":"Namespace's managed and/or tree labels have been updated","rid":24,"ns":"acp-shrd"} {"level":"info","ts":1663928177.1337447,"logger":"Secret.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":199,"trigger":"acp-shrd-srv-service-auth/squid-proxy-connection-string-25kfh2dm75","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"no-op"} {"level":"info","ts":1663928177.1342094,"logger":"Secret.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":197,"trigger":"acp-shrd-srv-service-auth/default-token-rrpl5","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"no-op"} {"level":"info","ts":1663928177.1343021,"logger":"Secret.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":196,"trigger":"acp-shrd-srv-service-auth/acp-shrd-srv-service-auth-configuration-kh5fdthk8t","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"no-op"} {"level":"info","ts":1663928177.2338564,"logger":"Secret.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":198,"trigger":"acp-shrd-srv-service-auth/sh.helm.release.v1.acp-shrd-srv-service-auth.v1","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"no-op"} {"level":"info","ts":1663928179.333814,"logger":"Provider.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":243,"trigger":"acp-shrd-srv-service-auth/flux-system-alertmanager-0","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"no-op"} {"level":"info","ts":1663928179.738065,"logger":"RoleBinding.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":259,"trigger":"acp-shrd-srv-service-auth/EKS-Shared-Admin","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"no-op"} {"level":"info","ts":1663928179.8322206,"logger":"RoleBinding.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":265,"trigger":"acp-shrd-srv-service-auth/EKS-Shared-Read","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"no-op"} {"level":"info","ts":1663928180.4438827,"logger":"Secret.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":297,"trigger":"acp-shrd-srv-service-auth/default-token-k97k6","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663928188.2342489,"logger":"NetworkPolicy.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":522,"trigger":"acp-shrd-srv-service-auth/ingress-shrd","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663928188.632802,"logger":"hierarchyconfig.reconcile","msg":"Namespace's managed and/or tree labels have been updated","rid":63,"ns":"acp-shrd-srv-service-auth"} {"level":"info","ts":1663928188.6328425,"logger":"hierarchyconfig.reconcile","msg":"ActivitiesHalted condition removed","rid":63,"ns":"acp-shrd-srv-service-auth"} {"level":"info","ts":1663928190.2332792,"logger":"Alert.reconcile","msg":"Deleted propagated object","rid":701,"trigger":"acp-shrd-srv-service-auth/shrd-kustomization-alert-config-alertmanager-0"} {"level":"info","ts":1663928196.034376,"logger":"Alert.reconcile","msg":"Deleted propagated object","rid":1123,"trigger":"acp-shrd-srv-service-auth/shrd-kustomization-alert-config-alertmanager-0"} {"level":"info","ts":1663928199.7321632,"logger":"Secret.reconcile","msg":"Deleted propagated object","rid":1565,"trigger":"acp-shrd-srv-service-auth/default-token-k97k6"} {"level":"info","ts":1663928200.938197,"logger":"Secret.reconcile","msg":"Deleted propagated object","rid":1620,"trigger":"acp-shrd-srv-service-auth/default-token-k97k6"} {"level":"info","ts":1663928231.9412584,"logger":"Provider.reconcile","msg":"Propagating object","rid":1710,"trigger":"acp-shrd-srv-service-auth/flux-system-alertmanager-0"} {"level":"info","ts":1663928244.1189086,"logger":"Alert.reconcile","msg":"Propagating object","rid":1865,"trigger":"acp-shrd-srv-service-auth/shrd-helmrelease-alert-config-alertmanager-0"} {"level":"info","ts":1663928244.1201637,"logger":"NetworkPolicy.reconcile","msg":"Propagating object","rid":1866,"trigger":"acp-shrd-srv-service-auth/ingress-shrd"} {"level":"info","ts":1663928976.3114276,"logger":"NetworkPolicy.reconcile","msg":"Updating propagated object","rid":1992,"trigger":"acp-shrd-srv-service-auth/egress-shrd"} You can see that the namespace originally has ActiviedHalted correctly, however at some point the ActiviesHalted condition gets removed, and the propagation of objects is allowed.

However, because the parents' objects haven't fully synced the object is incorrectly deleted. You can see later that the object is propagated once the parent has fully synced.

Alan01252 avatar Sep 23 '22 11:09 Alan01252

I added some more logging and ran again just to make this easier to hopefully see/prove I'm not going mad.

{"level":"info","ts":1663932832.942569,"logger":"hierarchyconfig.reconcile","msg":"Setting ConditionActivitiesHalted: parent doesn't exist","rid":40,"ns":"acp-shrd-srv-service-auth","parent":"acp-shrd"} {"level":"info","ts":1663932832.9425945,"logger":"hierarchyconfig.reconcile","msg":"Enqueuing","rid":40,"ns":"acp-shrd-srv-service-auth","reason":"removed as parent","affected":["<none>"]} {"level":"info","ts":1663932832.9426525,"logger":"hierarchyconfig.reconcile","msg":"Setting ActivitiesHalted on namespace","rid":40,"ns":"acp-shrd-srv-service-auth","conditions":[{"type":"ActivitiesHalted","status":"True","lastTransitionTime":"2022-09-23T11:33:52Z","reason":"ParentMissing","message":"Parent \"acp-shrd\" does not exist"}]} {"level":"info","ts":1663932832.9426742,"logger":"hierarchyconfig.reconcile","msg":"Enqueuing","rid":40,"ns":"acp-shrd-srv-service-auth","reason":"descendant of a namespace with ActivitiesHalted added","affected":[]} {"level":"info","ts":1663932834.5397835,"logger":"Role.reconcile","msg":"Deleted propagated object","rid":11,"trigger":"acp-shrd-srv-service-staticcontent/EKS-Shared-Read"} {"level":"info","ts":1663932835.7471323,"logger":"Alert.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":40,"trigger":"acp-shrd-srv-service-auth/shrd-helmrelease-alert-config-alertmanager-0","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"no-op"} {"level":"info","ts":1663932836.4446335,"logger":"Role.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":97,"trigger":"acp-shrd-srv-service-auth/EKS-Shared-Read","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663932836.5441487,"logger":"NetworkPolicy.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":117,"trigger":"acp-shrd-srv-service-auth/ingress-shrd","haltedRoot":"acp-shrd-srv-service-auth","suppressedAction":"remove"} {"level":"info","ts":1663932836.5450675,"logger":"hierarchyconfig.reconcile","msg":"ActivitiesHalted condition removed","rid":75,"ns":"acp-shrd-srv-service-auth"} {"level":"info","ts":1663932836.5450726,"logger":"hierarchyconfig.reconcile","msg":"Enqueuing","rid":75,"ns":"acp-shrd-srv-service-auth","reason":"descendant of a namespace with ActivitiesHalted removed","affected":[]} {"level":"info","ts":1663932836.738606,"logger":"Alert.reconcile","msg":"Deleted propagated object","rid":127,"trigger":"acp-shrd-srv-service-auth/shrd-kustomization-alert-config-alertmanager-0"} {"level":"info","ts":1663932836.741233,"logger":"Role.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":150,"trigger":"acp-shrd-srv-service-appconfiguration/EKS-Shared-Read","haltedRoot":"acp-shrd-srv-service-appconfiguration","suppressedAction":"remove"} {"level":"info","ts":1663932836.8385975,"logger":"Role.reconcile","msg":"Syncing source object","rid":166,"trigger":"acp-shrd/EKS-Shared-Read","namespace":"acp-shrd","cleansource":{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","namespace":"acp-shrd","name":"EKS-Shared-Read"}} {"level":"info","ts":1663932836.8406925,"logger":"Role.reconcile","msg":"Namespace has 'ActivitiesHalted' condition; will not touch propagated object","rid":182,"trigger":"acp-shrd-srv-service-wsipprovisioningmediator/EKS-Shared-Read","haltedRoot":"acp-shrd-srv-service-wsipprovisioningmediator","suppressedAction":"no-op"} {"level":"info","ts":1663932837.9457324,"logger":"Role.reconcile","msg":"Syncing source object","rid":299,"trigger":"acp-shrd/EKS-Shared-Read","namespace":"acp-shrd","cleansource":{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","namespace":"acp-shrd","name":"EKS-Shared-Read"}} {"level":"info","ts":1663932839.0426767,"logger":"Alert.reconcile","msg":"Deleted propagated object","rid":379,"trigger":"acp-shrd-srv-service-auth/shrd-kustomization-alert-config-alertmanager-0"} {"level":"info","ts":1663932842.2426782,"logger":"Provider.reconcile","msg":"Deleted propagated object","rid":576,"trigger":"acp-shrd-srv-service-auth/flux-system-alertmanager-0"} {"level":"info","ts":1663932842.7491846,"logger":"RoleBinding.reconcile","msg":"Deleted propagated object","rid":561,"trigger":"acp-shrd-srv-service-staticcontent/EKS-Shared-Read"} {"level":"info","ts":1663932843.143244,"logger":"RoleBinding.reconcile","msg":"Deleted propagated object","rid":663,"trigger":"acp-shrd-srv-service-helloworld/EKS-Shared-Read"} {"level":"info","ts":1663932843.4476533,"logger":"Secret.reconcile","msg":"Syncing source object","rid":731,"trigger":"acp-shrd-srv-service-auth/squid-proxy-connection-string-25kfh2dm75","namespace":"acp-shrd-srv-service-auth","cleansource":{"apiVersion":"v1","kind":"Secret","namespace":"acp-shrd-srv-service-auth","name":"squid-proxy-connection-string-25kfh2dm75"}} {"level":"info","ts":1663932843.4477758,"logger":"Secret.reconcile","msg":"Syncing source object","rid":723,"trigger":"acp-shrd-srv-service-auth/default-token-rrpl5","namespace":"acp-shrd-srv-service-auth","cleansource":{"apiVersion":"v1","kind":"Secret","namespace":"acp-shrd-srv-service-auth","name":"default-token-rrpl5"}} {"level":"info","ts":1663932843.6386151,"logger":"RoleBinding.reconcile","msg":"Deleted propagated object","rid":809,"trigger":"acp-shrd-srv-service-wsipprovisioningmediator/EKS-Shared-Read"} {"level":"info","ts":1663932843.6390092,"logger":"RoleBinding.reconcile","msg":"Syncing source object","rid":812,"trigger":"acp-shrd/EKS-Shared-Read","namespace":"acp-shrd","cleansource":{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"RoleBinding","namespace":"acp-shrd","name":"EKS-Shared-Read"}} {"level":"info","ts":1663932843.638133,"logger":"RoleBinding.reconcile","msg":"Deleted propagated object","rid":778,"trigger":"acp-shrd-srv-service-auth/EKS-Shared-Read"} {"level":"info","ts":1663932843.6424048,"logger":"Secret.reconcile","msg":"Syncing source object","rid":854,"trigger":"acp-shrd-srv-service-auth/acp-shrd-srv-service-auth-configuration-kh5fdthk8t","namespace":"acp-shrd-srv-service-auth","cleansource":{"apiVersion":"v1","kind":"Secret","namespace":"acp-shrd-srv-service-auth","name":"acp-shrd-srv-service-auth-configuration-kh5fdthk8t"}} {"level":"info","ts":1663932843.6424863,"logger":"Secret.reconcile","msg":"Syncing source object","rid":855,"trigger":"acp-shrd-srv-service-auth/sh.helm.release.v1.acp-shrd-srv-service-auth.v1","namespace":"acp-shrd-srv-service-auth","cleansource":{"apiVersion":"v1","kind":"Secret","namespace":"acp-shrd-srv-service-auth","name":"sh.helm.release.v1.acp-shrd-srv-service-auth.v1"}} {"level":"info","ts":1663932844.140371,"logger":"Secret.reconcile","msg":"Deleted propagated object","rid":1175,"trigger":"acp-shrd-srv-service-auth/default-token-k97k6"}

You can see the activities halted is removed here {"level":"info","ts":1663932836.5450675,"logger":"hierarchyconfig.reconcile","msg":"ActivitiesHalted condition removed","rid":75,"ns":"acp-shrd-srv-service-auth"} but the source object isn't synced until here {"level":"info","ts":1663932836.8385975,"logger":"Role.reconcile","msg":"Syncing source object","rid":166,"trigger":"acp-shrd/EKS-Shared-Read","namespace":"acp-shrd","cleansource":{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","namespace":"acp-shrd","name":"EKS-Shared-Read"}}

Alan01252 avatar Sep 23 '22 11:09 Alan01252

This is very tricky to solve with the current architecture and makes me concerned about using HNC in general.

Help would be appreciated if anyone has time/inclination ( and I completely understand this is a volunteer open source project so hope my tone is coming across okay! ) :)

The problem is that the forest is eventually consistent or not consistent at all ( due to relying on the data being in the API server ) meaning handling state ( specifically when it comes to removals ) is a complicated endeavour.

I first tried adding an annotation to all the decedent source objects and using that to determine whether HNC should be allowed to delete the resource, but this doesn't work on a restart either because the decedent source object is also not populated on a restart. I could I think fix this, but then I'm wondering how much I'm going to be pulling at the wool ball if I do so.

Alan01252 avatar Sep 26 '22 09:09 Alan01252

I've had a crack at modifying the code here. This is being tested in our pre-prod environment, but I don't expect this pull request to be merged. Just hoping that it sparks some discussion :)

https://github.com/kubernetes-sigs/hierarchical-namespaces/pull/229/files

Alan01252 avatar Sep 26 '22 12:09 Alan01252

Hey @Alan01252 sorry for the delay in getting back to you. Yes, HNC can mistakenly delete objects while it's restarting, I'd like to have a go at fixing this once and for all in the future. The idea I had in mind was:

  • Every HierarchyConfiguration object stores both its parent and its last known children
  • During startup, HNC can use this information to know if its tree is incomplete, and disable all insertions or deletions until all expected namespaces are accounted for.

However, the bigger problem is that HNC really shouldn't be restarting frequently. We've tried to set the default CPU and RAM limits generously, but have you tried raising them to see if that helps?

adrianludwin avatar Oct 02 '22 16:10 adrianludwin

I think a simpler fix might be to modify the forest so that we know if a source object is definitely present, definitely missing, or unknown (here). We already do that with namespaces themselves (ns.Exists()) but we should be able to do this with objects too.

So because the inherited-from label uniquely identifies the namespace that the object should have been propagated from, we could check to see if that object definitely does not exist, or if we're not sure. If we're not sure, we re-enqueue that object. Then everything should work as it does today - if it turns out the object really doesn't exist, HNC will enqueue all objects of the same type with the same name in all descendant namespaces, which will let them all be deleted. Otherwise, it'll be a no-op.

adrianludwin avatar Oct 02 '22 21:10 adrianludwin

Hi Adrian!

Thanks for the reply, tis very much appreciated.

However, the bigger problem is that HNC really shouldn't be restarting frequently. We've tried to set the default CPU and RAM limits generously, but have you tried raising them to see if that helps?

We've solved the immediate stability issues ( these were caused by us missing an alert on our pre prod environment ) however the reason this causes concern is we can not use HNC and do EKS/Node upgrades etc without having downtime of all our services due to propagated network policies etc being deleted when the HNC pod moves nodes. I assume anyone running HNC on spot instances would have similar challenges.

So because the inherited-from label uniquely identifies the namespace that the object should have been propagated from, we could check to see if that object definitely does not exist, or if we're not sure. If we're not sure, we re-enqueue that object. Then everything should work as it does today - if it turns out the object really doesn't exist, HNC will enqueue all objects of the same type with the same name in all descendant namespaces, which will let them all be deleted. Otherwise, it'll be a no-op.

This makes sense, and would presumably require another API call to see if the object actually does exist. I'll be honest I wasn't entirely sure how to implement that, so the method in the aforementioned pull request makes sure nothing is deleted unless it's been first seen in the forest.

Alan01252 avatar Oct 03 '22 06:10 Alan01252

Out of interest, what was the cause of the restarts if you don't mind me asking? I occasionally see people asking about this and I'm wondering if there's a common problem.

HNC's startup performance is quite poor because of a mistake I made in the earliest versions of HNC - I thought that each object was read by an individual call to the apiserver, so adding a few bad writes wasn't a big deal, performance wise. I later learned that all objects of a given type are read at the same time by a single call to the apiserver, which is very performant. Oops. This problem is fixable - basically, in addition to getting rid of the object deletions, we need to stop writing back HierarchyConfiguration objects to remove child namespaces from the status simply because they haven't been synced yet. But I haven't had a chance to do this.

If you want to have a crack at either (or both) of these problems, I can guide you in how to do that. Basically, extend the ns.Exists() idea for objects, possibly with some kind of periodic garbage collection so that non-existent objects are periodically forgotten. Then, stop HNC from deleting objects or updating HierarchyConfigurations based on known-incomplete data. Otherwise, I'll try to get to both by the end of the year.

On Mon, Oct 3, 2022 at 2:32 AM Alan Hollis @.***> wrote:

Hi Adrian!

Thanks for the reply, tis very much appreciated.

However, the bigger problem is that HNC really shouldn't be restarting frequently. We've tried to set the default CPU and RAM limits generously, but have you tried raising them to see if that helps?

We've solved the immediate stability issues ( these were caused by us missing an alert on our pre prod environment ) however the reason this causes concern is we can not use HNC and do EKS/Node upgrades etc without having downtime of all our services due to propagated network policies etc being deleted when the HNC pod moves nodes. I assume anyone running HNC on spot instances would have similar challenges.

So because the inherited-from label uniquely identifies the namespace that the object should have been propagated from, we could check to see if that object definitely does not exist, or if we're not sure. If we're not sure, we re-enqueue that object. Then everything should work as it does today - if it turns out the object really doesn't exist, HNC will enqueue all objects of the same type with the same name in all descendant namespaces, which will let them all be deleted. Otherwise, it'll be a no-op.

This makes sense, and would presumably require another API call to see if the object actually does exist. I'll be honest I wasn't entirely sure how to implement that, so the method in the aforementioned pull request makes sure nothing is deleted unless it's been first seen in the forest.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes-sigs/hierarchical-namespaces/issues/228#issuecomment-1264994930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE43PZGG4UNV5WXLCNPWZPLWBJ4YJANCNFSM6AAAAAAQT3MBXU . You are receiving this because you commented.Message ID: @.***>

adrianludwin avatar Oct 03 '22 13:10 adrianludwin

Thanks again for your reply Adrian, this all makes sense to me now :)

Out of interest, what was the cause of the restarts if you don't mind me asking? I occasionally see people asking about this and I'm wondering if there's a common problem.

We had a node that had high cpu utilisation and for reasons I don't really understand this was causing hnc to restart more frequently than other pods. We resolve the cpu utilisation on the node, and used kustomizse to patch and remove the CPU limit since then it's been solid as a rock.

If you want to have a crack at either (or both) of these problems, I can guide you in how to do that. Basically, extend the ns.Exists() idea for objects, possibly with some kind of periodic garbage collection so that non-existent objects are periodically forgotten.

I'll see what I can do, we get Friday afternoons to work on side projects at my current place of work so I'll see if I can implement what you've stated. I caveat all with this in stating that my full-time job isn't a go developer and so the code may not look like what you might have in your head!

Alan01252 avatar Oct 05 '22 06:10 Alan01252

For anyone looking for an update. I'm really sorry, but I've not been able to make this a priority to get resolved for various reasons.

Alan01252 avatar Dec 13 '22 07:12 Alan01252

@Alan01252 no worries, we all have jobs and lives so we're all in the same boat :)

adrianludwin avatar Feb 15 '23 13:02 adrianludwin

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar May 16 '23 13:05 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Jun 15 '23 13:06 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Jul 15 '23 14:07 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jul 15 '23 14:07 k8s-ci-robot