multi-cluster-app-dispatcher
multi-cluster-app-dispatcher copied to clipboard
Fix creation of AWs/Jobs with same name in different namespaces
Issue link
Resolves #433 And resolves #383 And closes #671
In the genericresource.go file there is a SyncQueueJob function which is used to create resources inside etcd. In this function, the identification of AppWrappers/Jobs was primarily based on a label (resourceName) which had a value of the AW name without taking into account the namespace. As a result, when 2 AWs/Jobs with the same name were created in different namespaces, the 2nd AW/Job was identified as a duplicate.
What changes have been made
- Updated labels from
appwrapper.mcad.ibm.comtoworkload.codeflare.dev/appwrapper, and fromresourceNametoworkload.codeflare.dev/resourceName. - Created helper function
GetRandomString()to truncate the name of objects in etcd at 60 characters instead of 63, to allow to append 3 random alphanumeric (lowercased) characters to avoid unexpected clashes when using similar-but-ending-differently long names. - Label selectors now take into account the AppWrapper namespace and not just the name.
- Updated quotaManagement.rbac.apiGroup from
ibm.comtoquota.codeflare.dev
Verification steps
- Have MCAD running.
- Create two AWs/Jobs with the same name but in different namespaces.
- See their pods created respectively, and running.
Checks
- [x] I've made sure the tests are passing.
- Testing Strategy
- [x] e2e test
- [x] Manual tests
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign asm582 for approval. For more information see the Kubernetes Code Review Process.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
Do we have any plans to change the label name from appwrapper.mcad.ibm.com to something consistent with the group name, maybe appwrapper.worload.codeflare.dev?
Do we have any plans to change the label name from
appwrapper.mcad.ibm.comto something consistent with the group name, maybeappwrapper.worload.codeflare.dev?
Yes, I agree it'll have to be changed, e.g., to worload.codeflare.dev/appwrapper. @tardieu is this the only purpose of that label? If yes, @ChristianZaccaria maybe you could tackle that change in the scope of that PR, WDYT?
Do we have any plans to change the label name from
appwrapper.mcad.ibm.comto something consistent with the group name, maybeappwrapper.worload.codeflare.dev?Yes, I agree it'll have to be changed, e.g., to
worload.codeflare.dev/appwrapper. @tardieu is this the only purpose of that label? If yes, @ChristianZaccaria maybe you could tackle that change in the scope of that PR, WDYT?
@astefanutti Yes, sounds good to me. It's a good idea to include the change here.
@tardieu is this the only purpose of that label?
The label is used to select resources directly created by MCAD (in genericresource.go) during creation, deletion, and to decide if the job is complete. As discussed, this is probably redundant with just querying for the generic items one by one.
The same label is used to select pods directly or indirectly caused by an appwrapper (in queuejob_controller_ex.go) to look at pod statuses and decide if the job is healthy or not.
The change needs to be consistent across. The introduction of a separate label for the namespace should also be done consistently across. Otherwise, we will end up confusing pods resulting from appwrapper mynamespace/myname with the pods of appwrapper myothernamespace/myname.
This does mean that all tests and all appwrapper yamls in circulation have to be adjusted. I would rather sync this change with the apigroup change than have users go back and forth between 3 CRD revisions (old group + old label, new group + old label, new group + new label).
If I read this correctly, the latest push adds a label with the namespace of the wrapped resource itself rather than the appwrapper namespace. I am not sure I understand the logic.
We should probably avoid unqualified label names like resourceName and replace them with names such as workload.codeflare.dev/resourceName.
If I read this correctly, the latest push adds a label with the namespace of the wrapped resource itself rather than the appwrapper namespace. I am not sure I understand the logic.
I think I understand your point. Would it make more sense to include labels for both wrapped resources and appwrappers? or just use the appwrapper name and namespace? I think that the issue with this is that i.e., if two jobs are created with same name and different namespaces, the 2nd job's pods are not created, so it's not originally the appwrapper where the issue lies. Let me know if I misunderstand, thank you Olivier
@astefanutti Hi Antonin, I'm wondering if you think updating the quotaManagement.rbac.apiGroup like this is correct? From ibm.com to quota.codeflare.dev. Thanks!
https://github.com/project-codeflare/multi-cluster-app-dispatcher/blob/bbb6871a4d79d1364cd04504557d0f1c4c671844/test/kuttl-test-borrowing.yaml#L8
I attempted to not use completely the resourceName label but caused the e2e tests to fail. I believe it has to do with certain scenarios where the resourceName differs from the AppWrapper name, such as this example: https://github.com/project-codeflare/multi-cluster-app-dispatcher/blob/41833f2bd925d39e7391165d402e4df76025a296/test/e2e-kuttl-deployment-01/steps/03-assert.yaml#L27-L30
Hello @ChristianZaccaria , a lot of changes seems to be unrelated, or indirectly at least, to the original issue:
shouldn't these change be merged in another PR? 🤔
Hello @ChristianZaccaria , a lot of changes seems to be unrelated, or indirectly at least, to the original issue:
shouldn't these change be merged in another PR? 🤔
Hi @kpouget, it was suggested to include these changes in the scope of this PR: https://github.com/project-codeflare/multi-cluster-app-dispatcher/pull/652#issuecomment-1741082212
