kcp
kcp copied to clipboard
bug: failure to provide identity hash in APIExport / APIBinding permission claims fails silently
Describe the bug
I had an APIExport that included permissionClaims for deployments/apps. Deployments are not a built-in type, and are only available in a given workspace after setting up a SyncTarget to a pcluster (which creates an APIExport with name kubernetes).
$ k get apiexports kubernetes -o yaml
apiVersion: apis.kcp.dev/v1alpha1
kind: APIExport
metadata:
annotations:
kcp.dev/cluster: root
workload.kcp.dev/skip-default-object-creation: "true"
creationTimestamp: "2022-10-07T12:17:03Z"
generation: 3
name: kubernetes
resourceVersion: "506"
uid: e3af3dac-d89a-48ff-8921-5fd0826a7cd3
spec:
identity:
secretRef:
name: kubernetes
namespace: kcp-system
latestResourceSchemas:
- rev-503.deployments.apps
status:
conditions:
- lastTransitionTime: "2022-10-07T12:17:03Z"
status: "True"
type: IdentityValid
- lastTransitionTime: "2022-10-07T12:17:03Z"
status: "True"
type: VirtualWorkspaceURLsReady
identityHash: 5fdf7c7aaf407fd1594566869803f565bb84d22156cef5c445d2ee13ac2cfca6
virtualWorkspaces:
- url: https://192.168.1.189:6443/services/apiexport/root/kubernetes
The permissionClaims in my APIExport looked like the following:
permissionClaims:
- group: ""
resource: secrets
- group: ""
resource: namespaces
- group: ""
resource: configmaps
- group: apps
# (SHOULD HAVE THIS) identityHash: 5fdf7c7aaf407fd1594566869803f565bb84d22156cef5c445d2ee13ac2cfca6
resource: deployments
Because I did not have the identity hash provided, when I created an APIBinding for the export, Deployments did not show up in the virtual workspace for the exported types.
$ k -s https://192.168.1.189:6443/services/apiexport/root/kxp.crossplane.io/clusters/* api-resources
NAME SHORTNAMES APIVERSION NAMESPACED KIND
configmaps v1 true ConfigMap
namespaces v1 false Namespace
secrets v1 true Secret
However, the failure for the permission claim to be applied was not reflected in the export (see in status above) or binding:
conditions:
- lastTransitionTime: "2022-10-07T12:54:01Z"
status: "True"
type: Ready
- lastTransitionTime: "2022-10-07T12:54:01Z"
status: "True"
type: APIExportValid
- lastTransitionTime: "2022-10-07T12:54:01Z"
status: "True"
type: BindingUpToDate
- lastTransitionTime: "2022-10-07T12:54:01Z"
status: "True"
type: InitialBindingCompleted
- lastTransitionTime: "2022-10-07T12:54:01Z"
status: "True"
type: PermissionClaimsApplied
- lastTransitionTime: "2022-10-07T12:54:01Z"
status: "True"
type: PermissionClaimsValid
exportPermissionClaims:
- resource: secrets
- resource: namespaces
- resource: configmaps
- group: apps
resource: deployments
Steps To Reproduce
(these are example steps, the same could be accomplished via a permissionClaim that doesn't reference an identity hash on any type provided by an APIExport)
- Setup
SyncTargetwithkubernetesAPIExport. - Create
APIExportreferencingdeployments/apps - Create
APIBindingto a workspace that references theAPIExportand accepts thepermissionClaims - Observe the type not being available in the virtual workspace for the api export
Expected Behaviour
I would expect that the APIExport would either be rejected on creation (via admission) or for failure to be surfaced in conditions. There is a bit of nuance here around different types of errors for export / binding. For example:
- something like admission could be used to say "this isn't a built-in type, so you need some identity hash"
- something like conditions could be used to say "your identity hash is incorrect" (this may already be the case, need to double check)
Additional Context
No response
/assign
As I was debugging this for my own situations export was partially working and native types like secrets and configMaps responded, but SyncTargets were giving me 404 errors. Which led me to quit a rabit hole while debugging and false-positive controller behaviour where it was working, but some flow was not.
I suspect this partial working state might create quite an interesting set of bugs/outages in productions.
I would be almost tempted to go for full deny of APIExport or not serve them at all in the virtualWorkspace if one of the claims are not working fine. This way we could avoid situations where it failed (even with Codition set to failed) but user would not notice and system can be tricky to debug.
So maybe for now status for visibility and a discussion if current behaviour to serve partially would be something we would want to change?
In addition I think we need to extend APIExports and APIBindings with CLI collumns to report status as view:
[mjudeikis@unknown faros-hub]$ kubectl get apibinding
NAME AGE
access.faros.sh 3h52m
apiresource.kcp.dev-1wked 4h3m
kubernetes 3h56m
scheduling.kcp.dev-1gowj 4h3m
tenancy.kcp.dev-59phz 4h3m
workload.kcp.dev-4gqw7 4h3m
is not very helpful if something is failing. I think separate issue in order if one does exit yet.
@mjudeikis thanks for the write-up!
I would be almost tempted to go for full deny of APIExport or not serve them at all in the virtualWorkspace if one of the claims are not working fine.
I'm currently thinking of the following strategy:
- identity hash is not provided for non-internal type: deny at admission
- if partial export happens for some other reason: surface in conditions
I could also see not surfacing the virtualWorkspace URL if export is not functioning properly, but we would need to account for the case where an issue is intermittent. This is somewhat related to the discussion on https://github.com/kcp-dev/kcp/issues/1183.
I could also see not surfacing the virtualWorkspace URL if export is not functioning properly
I would be almost tempted to go for full deny of APIExport or not serve them at all in the virtualWorkspace if one of the claims are not working fine. This way we could avoid situations where it failed (even with Codition set to failed) but user would not notice and system can be tricky to debug.
If I'm understanding correctly, you're both talking about some logic that would look kind of like what's in #2046, but in the controller and managing a condition or the virtualWorkspaceURL. Is that that right?
Issues go stale after 90d of inactivity.
After a furter 30 days, they will turn rotten.
Mark the issue as fresh with /remove-lifecycle stale.
If this issue is safe to close now please do so with /close.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.
/close
@kcp-ci-bot: Closing this issue.
In response to this:
Rotten issues close after 30d of inactivity. Reopen the issue with
/reopen. Mark the issue as fresh with/remove-lifecycle rotten./close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.