kiali icon indicating copy to clipboard operation
kiali copied to clipboard

"KIA0203 This subset's labels are not found in any matching host" For Argo Rollout canary scenario

Open Yorkerrr opened this issue 4 years ago • 12 comments

During canary rollout with Argocd with subset-traffic splitting: https://argoproj.github.io/argo-rollouts/features/traffic-management/istio/#subset-level-traffic-splitting

One of the destination is marked as: "KIA0203 This subset's labels are not found in any matching host", however its live and istio routes traffic correctly. Screen Shot 2021-07-23 at 3 03 45 PM

Istio routes traffic fine to both destinations and pods with labels from destination rule are available as well

$kubectl get pods -l app=example,rollouts-pod-template-hash=6fd44bcc96
NAME                       READY   STATUS    RESTARTS   AGE
example-6fd44bcc96-csqrf   2/2     Running   0          6h2m

 $kubectl get pods -l app=example,rollouts-pod-template-hash=64db898946
NAME                       READY   STATUS    RESTARTS   AGE
example-64db898946-hxq22   2/2     Running   0          23m

Versions used Kiali: 1.37.0 Istio: 1.8.1 Kubernetes flavour and version: AKS 1.20

To Reproduce

  1. Set up Canary Release with subset-level traffic splitting: https://argoproj.github.io/argo-rollouts/features/traffic-management/istio/#subset-level-traffic-splitting
  2. Trigger Canary rollout
  3. Check destination rule in Kiali

Expected behavior Subset should be available in Kiali.

output of: curl 'http://<kiali>/kiali/api/namespaces/example-dev/istio/destinationrules/example?validate=true' out.txt

Yorkerrr avatar Jul 23 '21 22:07 Yorkerrr

cc @xeviknal , any thoughts?

jshaughn avatar Jul 26 '21 12:07 jshaughn

thank you @Yorkerrr It is weird indeed. Because the second subset is almost the same that the first one, just changing one label. Is the second pod created the same exact way that the first one? Is it behind the same service? I think I definitely install jargo and try one example.

xeviknal avatar Jul 29 '21 10:07 xeviknal

@xeviknal , Yes, pods are created in the same way, the only thing is that Argorollout may change labels on pods on a fly and if kiali caches it, there might be some issues. It might be connected to https://github.com/kiali/kiali/issues/4141 . Looks like kiali handles replicasets quite not correctly in some cases, e.g if labels are changed after RS creation. So in case of Argorollout process, pods and RS are not immutable, argorollout updates their labels.

Yorkerrr avatar Jul 30 '21 17:07 Yorkerrr

I think this bug is originated due https://github.com/kiali/kiali/issues/4141#issuecomment-896653917, basically the custom controller is not well identified as a workload and that triggers that the validations logic is missing it on this task.

Probably when the referenced issues is resolved this one should be incorporated then.

Keeping open until this.

lucasponce avatar Aug 12 '21 08:08 lucasponce

hi @Yorkerrr,

I have been able to reproduce this scenario. Thank you again for pointing out. The problem is that validations only consider the Controllers, not the pods. In the case of Argo Rollouts, there is only one rollout (controller) for two different pods (with different labels). The validation (see the snippet bellow) only is able to receive the rollout that happens to have the labels of the stable version. That is why the first subset is valid but not the second one.

I believe that after approaching #4382 will be able to performantly use pods in this validation in order to perform accurately this scenario.

https://github.com/kiali/kiali/blob/fd31f19f427cc984bac993289287338dc43bf3b7/business/checkers/destinationrules/no_dest_checker.go#L95-L102

For documentation's sake, I've created this gist to ease the task to reproduce this scenario for non-versed argo rollouts devs: https://gist.github.com/xeviknal/9ab154d6e34c9101b54e076a16645adb

Moving this issue to waiting for external until #4382 will be merged.

xeviknal avatar Nov 05 '21 16:11 xeviknal

anything news?

jeongsm2 avatar Mar 24 '22 08:03 jeongsm2

Hi @jseongsm2 , now that #4382 is completed, we will continue on this one. Thanks.

leandroberetta avatar Mar 28 '22 12:03 leandroberetta

Hello, during the testing of this issue. I see the error, and, also, see that one of the workloads is not listed, well... we might be talking about the same workload really but, the argo rollout controler produces two replica sets, so we should have two workloads I guess, but that is not the case:

image

So, looking into the description of the validation:

image

And also looking into this comment (and especifically in the code snippet, that shows that the validation uses the workloads list).

hi @Yorkerrr,

I have been able to reproduce this scenario. Thank you again for pointing out. The problem is that validations only consider the Controllers, not the pods. In the case of Argo Rollouts, there is only one rollout (controller) for two different pods (with different labels). The validation (see the snippet bellow) only is able to receive the rollout that happens to have the labels of the stable version. That is why the first subset is valid but not the second one.

I believe that after approaching #4382 will be able to performantly use pods in this validation in order to perform accurately this scenario.

https://github.com/kiali/kiali/blob/fd31f19f427cc984bac993289287338dc43bf3b7/business/checkers/destinationrules/no_dest_checker.go#L95-L102

For documentation's sake, I've created this gist to ease the task to reproduce this scenario for non-versed argo rollouts devs: https://gist.github.com/xeviknal/9ab154d6e34c9101b54e076a16645adb

Moving this issue to waiting for external until #4382 will be merged.

I think a good thing to try is to see why there is a missing workload given that there is actually a ReplicaSet for it.

I will investigate that.

@lucasponce @hhovsepy what do you think about this?. To me is better to add this missing workload rather to modify the validation to start using the pods to validate.

leandroberetta avatar Jun 15 '22 15:06 leandroberetta

@leandroberetta agree to include missing workloads into list of considered Workloads in validation.

hhovsepy avatar Jun 15 '22 17:06 hhovsepy

@leandroberetta, @hhovsepy, I was considering lookinginto #5261. I see you have already been in this area but the last update was from June. Perhaps we can meet to talkab out the status and the feasibility of #5261?

jshaughn avatar Aug 15 '22 20:08 jshaughn

@leandroberetta, @hhovsepy, I was considering lookinginto #5261. I see you have already been in this area but the last update was from June. Perhaps we can meet to talkab out the status and the feasibility of #5261?

Hi @jshaughn yes, sure, I'm coming back next week. The last thing I was investigating is that on the cache, all controllers just keep 1 workload, even if there is more, so, Rollouts is a controller that owns two active workloads. Some logic (I guess) needs to be done to alllow more workloads if it's a Rollouts controller type.

But we can meet next week and continue discussing this.

leandroberetta avatar Aug 16 '22 11:08 leandroberetta

Stop monitoring issues :-) see you in a week!

jshaughn avatar Aug 16 '22 15:08 jshaughn