multi-cluster-app-dispatcher icon indicating copy to clipboard operation
multi-cluster-app-dispatcher copied to clipboard

Fix creation of AWs/Jobs with same name in different namespaces

Open ChristianZaccaria opened this issue 2 years ago • 12 comments

Issue link

Resolves #433 And resolves #383 And closes #671

In the genericresource.go file there is a SyncQueueJob function which is used to create resources inside etcd. In this function, the identification of AppWrappers/Jobs was primarily based on a label (resourceName) which had a value of the AW name without taking into account the namespace. As a result, when 2 AWs/Jobs with the same name were created in different namespaces, the 2nd AW/Job was identified as a duplicate.

What changes have been made

  • Updated labels from appwrapper.mcad.ibm.com to workload.codeflare.dev/appwrapper, and from resourceName to workload.codeflare.dev/resourceName.
  • Created helper function GetRandomString() to truncate the name of objects in etcd at 60 characters instead of 63, to allow to append 3 random alphanumeric (lowercased) characters to avoid unexpected clashes when using similar-but-ending-differently long names.
  • Label selectors now take into account the AppWrapper namespace and not just the name.
  • Updated quotaManagement.rbac.apiGroup from ibm.com to quota.codeflare.dev

Verification steps

  1. Have MCAD running.
  2. Create two AWs/Jobs with the same name but in different namespaces.
  3. See their pods created respectively, and running.

Checks

  • [x] I've made sure the tests are passing.
  • Testing Strategy
    • [x] e2e test
    • [x] Manual tests

ChristianZaccaria avatar Sep 29 '23 12:09 ChristianZaccaria

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign asm582 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci[bot] avatar Sep 29 '23 12:09 openshift-ci[bot]

Do we have any plans to change the label name from appwrapper.mcad.ibm.com to something consistent with the group name, maybe appwrapper.worload.codeflare.dev?

tardieu avatar Sep 29 '23 15:09 tardieu

Do we have any plans to change the label name from appwrapper.mcad.ibm.com to something consistent with the group name, maybe appwrapper.worload.codeflare.dev?

Yes, I agree it'll have to be changed, e.g., to worload.codeflare.dev/appwrapper. @tardieu is this the only purpose of that label? If yes, @ChristianZaccaria maybe you could tackle that change in the scope of that PR, WDYT?

astefanutti avatar Sep 29 '23 15:09 astefanutti

Do we have any plans to change the label name from appwrapper.mcad.ibm.com to something consistent with the group name, maybe appwrapper.worload.codeflare.dev?

Yes, I agree it'll have to be changed, e.g., to worload.codeflare.dev/appwrapper. @tardieu is this the only purpose of that label? If yes, @ChristianZaccaria maybe you could tackle that change in the scope of that PR, WDYT?

@astefanutti Yes, sounds good to me. It's a good idea to include the change here.

ChristianZaccaria avatar Sep 29 '23 15:09 ChristianZaccaria

@tardieu is this the only purpose of that label?

The label is used to select resources directly created by MCAD (in genericresource.go) during creation, deletion, and to decide if the job is complete. As discussed, this is probably redundant with just querying for the generic items one by one.

The same label is used to select pods directly or indirectly caused by an appwrapper (in queuejob_controller_ex.go) to look at pod statuses and decide if the job is healthy or not.

The change needs to be consistent across. The introduction of a separate label for the namespace should also be done consistently across. Otherwise, we will end up confusing pods resulting from appwrapper mynamespace/myname with the pods of appwrapper myothernamespace/myname.

This does mean that all tests and all appwrapper yamls in circulation have to be adjusted. I would rather sync this change with the apigroup change than have users go back and forth between 3 CRD revisions (old group + old label, new group + old label, new group + new label).

tardieu avatar Sep 29 '23 16:09 tardieu

If I read this correctly, the latest push adds a label with the namespace of the wrapped resource itself rather than the appwrapper namespace. I am not sure I understand the logic.

tardieu avatar Oct 04 '23 16:10 tardieu

We should probably avoid unqualified label names like resourceName and replace them with names such as workload.codeflare.dev/resourceName.

tardieu avatar Oct 04 '23 16:10 tardieu

If I read this correctly, the latest push adds a label with the namespace of the wrapped resource itself rather than the appwrapper namespace. I am not sure I understand the logic.

I think I understand your point. Would it make more sense to include labels for both wrapped resources and appwrappers? or just use the appwrapper name and namespace? I think that the issue with this is that i.e., if two jobs are created with same name and different namespaces, the 2nd job's pods are not created, so it's not originally the appwrapper where the issue lies. Let me know if I misunderstand, thank you Olivier

ChristianZaccaria avatar Oct 05 '23 09:10 ChristianZaccaria

@astefanutti Hi Antonin, I'm wondering if you think updating the quotaManagement.rbac.apiGroup like this is correct? From ibm.com to quota.codeflare.dev. Thanks! https://github.com/project-codeflare/multi-cluster-app-dispatcher/blob/bbb6871a4d79d1364cd04504557d0f1c4c671844/test/kuttl-test-borrowing.yaml#L8

ChristianZaccaria avatar Nov 14 '23 15:11 ChristianZaccaria

I attempted to not use completely the resourceName label but caused the e2e tests to fail. I believe it has to do with certain scenarios where the resourceName differs from the AppWrapper name, such as this example: https://github.com/project-codeflare/multi-cluster-app-dispatcher/blob/41833f2bd925d39e7391165d402e4df76025a296/test/e2e-kuttl-deployment-01/steps/03-assert.yaml#L27-L30

ChristianZaccaria avatar Nov 14 '23 15:11 ChristianZaccaria

Hello @ChristianZaccaria , a lot of changes seems to be unrelated, or indirectly at least, to the original issue: image

shouldn't these change be merged in another PR? 🤔

kpouget avatar Nov 24 '23 17:11 kpouget

Hello @ChristianZaccaria , a lot of changes seems to be unrelated, or indirectly at least, to the original issue: image

shouldn't these change be merged in another PR? 🤔

Hi @kpouget, it was suggested to include these changes in the scope of this PR: https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/652#issuecomment-1741082212

ChristianZaccaria avatar Nov 27 '23 10:11 ChristianZaccaria