openshift-management icon indicating copy to clipboard operation
openshift-management copied to clipboard

[RFE] Improve OCP project pruning

Open jtudelag opened this issue 4 years ago • 1 comments

In the operator era, when deleting an OCP project, first we have to delete operator related resources and the operator itself, otherwise OCP projects get stuck in "Terminating".

This just an example: of an OCP project stuck:

# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: Namespace
metadata:
  annotations:
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: jtudela
    openshift.io/sa.scc.mcs: s0:c37,c29
    openshift.io/sa.scc.supplemental-groups: 1001390000/10000
    openshift.io/sa.scc.uid-range: 1001390000/10000
  creationTimestamp: "2020-05-25T11:03:59Z"
  deletionTimestamp: "2020-06-01T12:00:10Z"
  name: acm-playground-jtudela
  resourceVersion: "13574312"
  selfLink: /api/v1/namespaces/acm-playground-jtudela
  uid: 04dcd1e7-1865-4abb-85f4-3af7b350c844
spec:
  finalizers:
  - kubernetes
status:
  conditions:
  - lastTransitionTime: "2020-06-01T13:08:44Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2020-06-01T12:00:25Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2020-06-01T12:01:00Z"
    message: All content successfully deleted, may be waiting on finalization
    reason: ContentDeleted
    status: "False"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2020-06-01T12:00:25Z"
    message: 'Some resources are remaining: helmreleases.apps.open-cluster-management.io
      has 12 resource instances, multiclusterhubs.operators.open-cluster-management.io
      has 1 resource instances, policies.policy.mcm.ibm.com has 2 resource instances'
    reason: SomeResourcesRemain
    status: "True"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2020-06-01T12:00:25Z"
    message: 'Some content in the namespace has finalizers remaining: finalizer.operators.open-cluster-management.io
      in 1 resource instances, propagator.finalizer.mcm.ibm.com in 2 resource instances,
      uninstall-helm-release in 12 resource instances'
    reason: SomeFinalizersRemain
    status: "True"
    type: NamespaceFinalizersRemaining
  phase: Terminating

To achieve this, we might enhance the logic of this script: https://github.com/redhat-cop/openshift-management/blob/92aa7797d0d55f0c49a0b80d465c6c83fcab6f0a/images/prune-ocp-projects/include/prune-ocp-projects.sh

Upstream issue to get more context: https://github.com/kubernetes/kubectl/issues/151

jtudelag avatar Jun 01 '20 13:06 jtudelag

While I generally agree this is a particular problem, it is not the responsibility of this pruning operator to clean up after the failures of other operators. This particular scenario occurs when an operator that creates resources fails to account for namespace deletion and its own finalizers being set to delete in a proper cascading manner so as to fulfill all the finalizers. Hence, this situation is a problem of the operator being removed, and should result in filing an issue with that operator team/repo. Each one of these particular situations is a careful situation to be analyzed and probably not something we want to encourage people to run automated cleanup jobs on.

briantward avatar Jul 09 '20 17:07 briantward