mcs-api Alpha → Beta Graduation

From KEP 1645, here are the outlined requirements for graduation from Alpha to Beta for MCS:

[x] A detailed DNS spec for multi-cluster services.
[x] NetworkPolicy either solved or explicitly ruled out.
[x] API group chosen and approved.
[x] E2E tests exist for MCS services.
[x] Beta -> GA Graduation criteria defined.
[x] At least one MCS DNS implementation.
[x] A formal plan for a standard Cluster ID.
[x] Finalize a name for the "supercluster" concept.
[x] Cluster ID KEP is in beta
[ ] https://github.com/kubernetes-sigs/mcs-api/pull/85 to collapse spec/status

Reference:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-multicluster/1645-multi-cluster-services-api#alpha---beta-graduation

This issue proposes that we track the progress of the above steps (or modify them as appropriate) and make this a canonical reference for graduation progess.

Apr 08 '24 20:04 jackfrancis

/assign @lauralorenz

Open to any thoughts about current appetite to progress towards graduation, and if an issue like this would be help that effort. Happy to help contribute in any way!

Apr 08 '24 20:04 jackfrancis

We've seen good uptake of the general concepts of ServiceExport and ServiceImport and I'd be excited to help get this API promoted, but I think there's a few implementation details that may need revisions first:

[ ] spec.ports, spec.ips and spec.type likely belong under status, not spec
- The source of truth for these values are on the exported Service and these fields should be written to by a controller syncing the ServiceImport resource to a cluster, not a human. Manually updating these fields could cause unexpected behavior.
- GKE has actually forked the CRD to make this change, and fixing this upstream I hope could allow them to de-fork.
- Azure Fleet engineers asked about this in Kubernetes Slack while building an MCS implementation, and we would be supportive of changing this upstream.
- The AWS CloudMap MCS blog post includes log snippets showing "ServiceImport IPs need update".
- I've discussed this recently with @JeremyOT and @skitt and I'd be willing to write a short GEP proposing this change if necessary.
[ ] KEP-2149 ClusterID has reached beta in Kubernetes v1.28 and feels like it qualifies as a formal plan for a standard Cluster ID, but we should clarify the potentially overlapping scope here between cluster-scoped ClusterProperty CRDs and the status.properties field in the ClusterInventory API proposal. Is one of these authoritative and causes the other to be updated? Is that relationship unidirectional or implementation-specific?)
[ ] IIRC "supercluster" was used to refer to "a clusterset of clustersets"
- I think ClusterInventory can solve part of this problem space as "a list of clusters, which may include metadata on if they belong to a ClusterSet", but a strict heirarchy here is not what's needed.
- What I believe will be needed (as future design space, not blocking advancement), is cross-ClusterSet "peering" relationships - the ability to export a service beyond a ClusterSet to be imported/consumed by a different Cluster or ClusterSet which may not have same "sameness" guarantees. Notably, this likely means that automatically generating a ServiceImport with an appropriate name and placing it in an appropriate namespace may not be possible to automate, and this could be a place where manually creating and naming a ServiceImport becomes a way to handle this, using spec fields to handle mapping to a ClusterSet-external exported service "known" to the cluster, and likely adding a status.conditions field to ServiceImport for a controller to report if this attempted mapping was successful.
[ ] Better define desired behavior and known patterns for managing ClusterSet membership and included/excluded namespace controls.
- The spec currently describes these mostly from a "bottoms-up" perspective as managed by cluster-scoped resources within individual member clusters, but some MCS implementations manage these through a "top-down" hub cluster architecture. At minimum we should describe risks/constraints of each approach, and possibly consider documenting how an MCS controller should report status for an unauthorized cluster attempting to join a ClusterSet, or create a ServiceExport in a namespace which should be private to each cluster in a ClusterSet (such as the metrics namespace example in the sameness position statement).
- Refs https://github.com/kubernetes/community/pull/6748

Apr 08 '24 22:04 mikemorris

KEP-2149 ClusterID has reached beta in Kubernetes v1.28 and feels like it qualifies as a formal plan for a standard Cluster ID, but we should clarify the potentially overlapping scope here between cluster-scoped ClusterProperty CRDs and the status.properties field in the ClusterInventory API proposal. Is one of these authoritative and causes the other to be updated? Is that relationship unidirectional or implementation-specific?)

The intention of KEP-2149 was to meet this graduation criteria for MCS API. Personally I see one causing the other to be updated as About API is the cluster-local/"spoke" version of this information, and ClusterInventory is the management/registry/"hub" version. Some potential variations of this are discussed here in the ClusterID KEP. Regardless, I perceive the direction of the relationship to be implementation-specific. If we do feel the ned to clairfy this scope, I think it should happen in KEP-2149: ClusterID and/or the ClusterInventory KEP and not block the graduation of MCS API.

IIRC "supercluster" was used to refer to "a clusterset of clustersets"

Supercluster in the context of MCS API at least is just the old name of "clusterset". It was changed to clusterset via community vote. Just a name :)

The interpretation of the word as "a clusterset of clustersets" is more relevant for ClusterInventory IMHO. MCS API is explicitly only within scope of a single clusterset.

Better define desired behavior and known patterns for managing ClusterSet membership and included/excluded namespace controls.

I don't consider this a blocker for MCS API graduation as it was specifically designed to leave these mechanics out of scope. I do think they are worth discussing separately especially in the context of ClusterInventory.

cc @mikemorris

May 01 '24 19:05 lauralorenz

cc @jackfrancis

Cool with this being canonical, old version was in https://docs.google.com/document/d/12znQZGyRdUWbHkKuif0ySZdm0SMIymGJFQvAxVREMpE/edit for posterity/reference.

Where that doc left off last, the only thing we perceived as missing was the ClusterID beta (which has now happened) and fully fleshing out the MCS API e2e tests as they were described in the original KEP. There was a LOT of chatter about the e2e tests and what constituted being "done" with that, whether the original KEP covered enough cases or not, etc etc. Naturally there was a spectrum of "just ship it" to "fully fledged conformance tests all the implementers could run". The latter is still interesting to me but ultimately not a hard blocker for MCS API beta.

Where we landed then and what I suggest now is to continue with the effort of finalizing the e2e tests as written in the original KEP, the last state of which is captured in https://github.com/kubernetes-sigs/mcs-api/issues/14 (as you can see the related PRs and issues got stale over time). If we repick up there the scope stays very small.

May 01 '24 19:05 lauralorenz

@lauralorenz thanks for the background, I checked all the boxes in this issue description w/ the exception of the E2E item. Happy to take over that effort, we can discuss the scope of what "done" means in next SIG call.

Thank you!

cc @nojnhuh

May 01 '24 19:05 jackfrancis

/assign @nojnhuh

May 17 '24 19:05 jackfrancis

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Aug 15 '24 20:08 k8s-triage-robot

/remove-lifecycle stale

IT IS HAPPENING PEOPLE

Aug 15 '24 20:08 jackfrancis

@mikemorris @ryanzhang-oss are we able to add the desired changes to spec/status as a line item here (in other words, do we have definitive, consensus requirements on what those look like)?

I believe we are done w/ E2E tests, which according to the original scope of this issue, would close it. But we do want these additional API changes, and a v1alpha2, prior to graduating to v1beta1, so we want to point that out in the issue details IMO.

That's my understanding, at least!

Sep 03 '24 16:09 jackfrancis

@jackfrancis I've updated https://github.com/kubernetes-sigs/mcs-api/pull/52#issuecomment-2327303276 with the proposed changes to collapse spec/status fields to root, planning to coordinate with @ryanzhang-oss to share with Karmada devs for sufficient consensus (we have agreement so far from involved parties from Google, Microsoft and Red Hat who have been active in SIG-Multicluster).

Sep 03 '24 20:09 mikemorris

@jackfrancis I've updated #52 (comment) with the proposed changes to collapse spec/status fields to root, planning to coordinate with @ryanzhang-oss to share with Karmada devs for sufficient consensus (we have agreement so far from involved parties from Google, Microsoft and Red Hat who have been active in SIG-Multicluster).

Yay, added a new checkbox to the issue description here, and checked off the E2E item, thank you @nojnhuh!

Sep 03 '24 21:09 jackfrancis