ingress-nginx icon indicating copy to clipboard operation
ingress-nginx copied to clipboard

Canary ingress makes service it uses unusable for other ingresses

Open Zyava opened this issue 3 years ago • 23 comments

We still observe this issue with the following ingress-nginx version:

bash-5.1$ /nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v0.47.0
  Build:         7201e37633485d1f14dbe9cd7b22dd380df00a07
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.20.1

Steps to reproduce:

  1. create 2 services:

    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: webclient-prod1
      namespace: prod
    spec:
      ports:
      - name: http
        port: 80
        protocol: TCP
        targetPort: http
      selector:
        app.kubernetes.io/name: webclient-prod1
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: webclient-prod2
      namespace: prod
    spec:
      ports:
      - name: http
        port: 80
        protocol: TCP
        targetPort: http
      selector:
        app.kubernetes.io/name: webclient-prod2
    
  2. create 2 ingresses pointing to each of the services from step 1:

    ---
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: webclient-prod1
      namespace: prod
    spec:
      rules:
      - host: webclient.prod1.domain.com
        http:
          paths:
          - backend:
              service:
                name: webclient-prod1
                port:
                  number: 80
            path: /
            pathType: ImplementationSpecific
      tls:
      - hosts:
        - webclient.prod1.domain.com
        secretName: webclient-prod1-tls
    ---
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: webclient-prod2
      namespace: prod
    spec:
      rules:
      - host: webclient.prod2.domain.com
        http:
          paths:
          - backend:
              service:
                name: webclient-prod2
                port:
                  number: 80
            path: /
            pathType: ImplementationSpecific
      tls:
      - hosts:
        - webclient.prod2.domain.com
        secretName: webclient-prod2-tls
    
  3. create 2 more ingresses which should serve as virtual IP domain, pointing to each of the services from step 1, one normal and one with nginx.ingress.kubernetes.io/canary: "true" annotation:

    ---
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      name: webclient-prod1-vip
      namespace: prod
    spec:
      rules:
      - host: webclient.domain.com
        http:
          paths:
          - backend:
              service:
                name: webclient-prod1
                port:
                  number: 80
            path: /
            pathType: ImplementationSpecific
      tls:
      - hosts:
        - webclient.domain.com
        secretName: webclient-prod1-vip-tls
    ---
    apiVersion: networking.k8s.io/v1
    kind: Ingress
    metadata:
      annotations:
        nginx.ingress.kubernetes.io/canary: "true"
        nginx.ingress.kubernetes.io/canary-weight: "0"
      name: webclient-prod2-vip
      namespace: prod
    spec:
      rules:
      - host: webclient.domain.com
        http:
          paths:
          - backend:
              service:
                name: webclient-prod2
                port:
                  number: 80
            path: /
            pathType: ImplementationSpecific
      tls:
      - hosts:
        - webclient.domain.com
        secretName: webclient-prod2-vip-tls
    

The idea is to always expose service webclient-prod1 on its own subdomain webclient.prod1.domain.com and service webclient-prod2 on subdomain webclient.prod2.domain.com. The VIP subdomain webclient.domain.com should be normally pointed to service webclient-prod1 but can be in a canary manner (via changing nginx.ingress.kubernetes.io/canary-weight annotation value) be partially or fully switched to service webclient-prod2.

Expected behaviour: both webclient.prod1.domain.com and webclient.prod2.domain.com subdomains always work and connected to the corresponding services. webclient.domain.com also works and is pointed to one of the services (or traffic is split between them as configured in nginx.ingress.kubernetes.io/canary-weight annotation value).

Actual behaviour: webclient.prod2.domain.com subdomain is always pointed to default backend service (404), the rest works as expected.

From my understanding, this problem is caused by the bug discussed here and the root cause is that upstream for canary ingress is not created and marked with "noServer": true flag (see below) even if there is another ingress which uses the same service.

Output of curl localhost:10246/configuration/backends command:

[
  {
    "name": "prod-webclient-prod1-80",
    "service": {
      "metadata": {
        "creationTimestamp": null
      },
      "spec": {
        "ports": [
          {
            "name": "http",
            "protocol": "TCP",
            "port": 80,
            "targetPort": "http"
          }
        ],
        "selector": {
          "app.kubernetes.io/name": "webclient-prod1"
        },
        "clusterIP": "172.20.81.176",
        "clusterIPs": [
          "172.20.81.176"
        ],
        "type": "ClusterIP",
        "sessionAffinity": "None"
      },
      "status": {
        "loadBalancer": {}
      }
    },
    "port": 80,
    "sslPassthrough": false,
    "endpoints": [
      {
        "address": "10.8.23.179",
        "port": "8080"
      },
      {
        "address": "10.8.3.38",
        "port": "8080"
      }
    ],
    "sessionAffinityConfig": {
      "name": "",
      "mode": "",
      "cookieSessionAffinity": {
        "name": ""
      }
    },
    "upstreamHashByConfig": {
      "upstream-hash-by-subset-size": 3
    },
    "noServer": false,
    "trafficShapingPolicy": {
      "weight": 0,
      "header": "",
      "headerValue": "",
      "headerPattern": "",
      "cookie": ""
    },
    "alternativeBackends": [
      "prod-webclient-prod2-80"
    ]
  },
  {
    "name": "prod-webclient-prod2-80",
    "service": {
      "metadata": {
        "creationTimestamp": null
      },
      "spec": {
        "ports": [
          {
            "name": "http",
            "protocol": "TCP",
            "port": 80,
            "targetPort": "http"
          }
        ],
        "selector": {
          "app.kubernetes.io/name": "webclient-prod2"
        },
        "clusterIP": "172.20.224.174",
        "clusterIPs": [
          "172.20.224.174"
        ],
        "type": "ClusterIP",
        "sessionAffinity": "None"
      },
      "status": {
        "loadBalancer": {}
      }
    },
    "port": 80,
    "sslPassthrough": false,
    "endpoints": [
      {
        "address": "10.8.29.221",
        "port": "8080"
      },
      {
        "address": "10.8.6.172",
        "port": "8080"
      }
    ],
    "sessionAffinityConfig": {
      "name": "",
      "mode": "",
      "cookieSessionAffinity": {
        "name": ""
      }
    },
    "upstreamHashByConfig": {
      "upstream-hash-by-subset-size": 3
    },
    "noServer": true,
    "trafficShapingPolicy": {
      "weight": 0,
      "header": "",
      "headerValue": "",
      "headerPattern": "",
      "cookie": ""
    }
  },
  {
    "name": "upstream-default-backend",
    "service": {
      "metadata": {
        "creationTimestamp": null
      },
      "spec": {
        "ports": [
          {
            "name": "http",
            "protocol": "TCP",
            "port": 80,
            "targetPort": "http"
          }
        ],
        "selector": {
          "app.kubernetes.io/component": "default-backend",
          "app.kubernetes.io/instance": "ingress-nginx-public",
          "app.kubernetes.io/name": "ingress-nginx"
        },
        "clusterIP": "172.20.177.223",
        "clusterIPs": [
          "172.20.177.223"
        ],
        "type": "ClusterIP",
        "sessionAffinity": "None"
      },
      "status": {
        "loadBalancer": {}
      }
    },
    "port": 0,
    "sslPassthrough": false,
    "endpoints": [
      {
        "address": "10.8.12.58",
        "port": "8080"
      },
      {
        "address": "10.8.19.96",
        "port": "8080"
      }
    ],
    "sessionAffinityConfig": {
      "name": "",
      "mode": "",
      "cookieSessionAffinity": {
        "name": ""
      }
    },
    "upstreamHashByConfig": {},
    "noServer": false,
    "trafficShapingPolicy": {
      "weight": 0,
      "header": "",
      "headerValue": "",
      "headerPattern": "",
      "cookie": ""
    }
  }
]

Zyava avatar Dec 01 '21 16:12 Zyava

One can be redirected to default backend for more than one reason.

Please try Controller version 0.50.0 and update with the following info ;

  • kubectl get all,ing -A -o wide
  • kubectl -n prod describe po webclient-prod1
  • kubectl -n prod describe po webclient-prod2
  • kubectl -n prod describe svc webclient-prod1
  • kubectl -n prod describe svc webclient-prod2
  • kubectl -n prod describe ing webclient-prod1
  • kubectl -n prod describe ing webclient-prod2
  • kubectl -n prod describe ing webclient-prod1-vip
  • kubectl -n prod describe ing webclient-prod2-vip
  • Kubectl -n ingresscontrollernamespace logs ingresscontrollerpodname
  • Your complete and exact curl command as executed during test and the complete response to curl in verbose mode

longwuyuan avatar Dec 02 '21 08:12 longwuyuan

@longwuyuan I'm sorry but your request is not relevant. I provided output of curl localhost:10246/configuration/backends command which clearly shows that there is no upstream created for webclient-prod2 service ("noServer": true).

Zyava avatar Dec 02 '21 08:12 Zyava

This issue https://github.com/kubernetes/ingress-nginx/issues/4667 implies that the backend will be common to 2 ingress objects but based on the field ingress.spec.rules.hosts being different, the upstream configured in nginx.conf should be different. I will be happy to be proved wrong but I think the value of the field ingress.spec.rules.http.paths.backend.service is used to configure upstream. So its not clear as to which field(s) will provide 2 different values to configure 2 different upstream.

Hope some expert comments and makes progress for you.

On a different note, is this a real live use case ?

longwuyuan avatar Dec 02 '21 11:12 longwuyuan

Yes, this is setup we wanted to use for our PROD environment but unfortunately we couldn't, since canary ingress makes service it is pointed to unusable for any other ingress. We eventually solved it via creating 2 additional services which leads to more work to keep all services in sync.

Zyava avatar Dec 02 '21 12:12 Zyava

Can you point me to some documentation related to this concept of "The VIP subdomain". First time I see "VIP" used in Kubernetes object of kind service. Its of-course a popular use case in non-Kubernetes infrastructures. So want to read about it.

longwuyuan avatar Dec 02 '21 12:12 longwuyuan

I believed that 'trafficShapingPolicy' in the backend info via curl localhost:10246/configuration/backends is used for the canary behavior. But the same backend will share between ingress objects point to one service, it can be proved by its naming convention --, and they are not related to the which host and path you use.

I've met this problem in our customer's production cluster for times, and it seems to be a frequently encountered problem.

Lyt99 avatar Dec 02 '21 12:12 Lyt99

I think you should add your nginx.conf of the controller pod here, just for clarity.

longwuyuan avatar Dec 02 '21 12:12 longwuyuan

@longwuyuan VIP is just a name for (sub)domain, we can name it "main" instead. The idea is to switch (sub)domain from one service to another in blue/green manner (similar to how in the past people used to move IP from one VM to another).

Please ignore the naming (it might be not the best) and concentrate on the bug instead...

Zyava avatar Dec 02 '21 12:12 Zyava

Let me give another example:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: base
spec:
  rules:
  - host: test.example.com
    http:
      paths:
      - pathType: Prefix
        path: /
        backend:
          service:
            name: svc-prod
            port: 
              number: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: base1
spec:
  rules:
  - host: test1.example.com
    http:
      paths:
      - pathType: Prefix
        path: /
        backend:
          service:
            name: svc-prod
            port: 
              number: 80
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: canary
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-by-header: "X-Canary"
    nginx.ingress.kubernetes.io/canary-by-header-value: "canary"
spec:
  rules:
  - host: test.example.com
    http:
      paths:
      - pathType: Prefix
        path: /
        backend:
          service:
            name: svc-canary
            port: 
              number: 80

we use svc-prod service in base and base1 ingress for test.example.com and test1.example.com. and then make a canary ingress f or test.example.com pointing to svc-canary.

we expected that we could access svc-prod from test.example.com without canary headers (X-Canary: canary) and test1.example.com with any headers. but the fact is, when you access test1.example.com with a canary header, the traffic will be redirected to svc-canary, which are out of our expectation.

But the situation above seems kinda different from mine, and I'm not sure if they're related now :(

Lyt99 avatar Dec 02 '21 12:12 Lyt99

Just trying to know what problem you want to solve. Looks like you want blue/green and Canary working together.

Thanks, ; Long Wu Yuan

On 12/2/21 6:17 PM, Dmytro Zavalkin wrote:

@longwuyuan https://github.com/longwuyuan VIP is just a name for (sub)domain, we can name it "main" instead. The idea is to switch (sub)domain from one service to another in blue/green manner (similar to how in the past people used to move IP from one VM to another).

Please ignore the naming (it might be not the best) and concentrate on the bug instead...

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes/ingress-nginx/issues/8004#issuecomment-984596598, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGZVWV5WEJ4TLO7SWYHGKLUO5TGFANCNFSM5JE5UHWA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

longwuyuan avatar Dec 02 '21 13:12 longwuyuan

Not really, basically we want to have both services webclient-prod1 and webclient-prod2 to be always exposed on their own subdomains (ingresses) webclient.prod1.domain.com and webclient.prod2.domain.com and have another main/VIP/you name it subdomain (ingress) webclient.domain.com which we point to either webclient-prod1 service or webclient-prod2 service (or both with canary traffic split) as required by business logic/needs.

Unfortunately, because of the bug I described above - subdomain (ingress) webclient.prod2.domain.com is always pointed to the default backend service instead of webclient-prod2 service, because no upstream is created for webclient-prod2 service (it is marked with "noServer": true flag). I hope it makes more sense now.

Zyava avatar Dec 02 '21 13:12 Zyava

I think I know what's going on for @Zyava's problem. I dig into the code and found NoServer means "skip generating server block in nginx.conf for this upstream". But it actually behaves "skip generating location block for this upstream" at https://github.com/kubernetes/ingress-nginx/blob/c0814c6f784e63f08768a935234afc201cf5a5f2/internal/ingress/controller/controller.go#L648-L650 It reasonable because canary service doesn't need to generate the rule in nginx configuration, or it will conflict with the main upstream, but when use a canary service from another ingress object as a main service, it won't generate the rule in nginx.conf.

And it's clear that is not the same problem as mine, but it has a common issue that canary should be associated to upstream + host + path, not only upstream.

Lyt99 avatar Dec 02 '21 13:12 Lyt99

Wondering if you are requesting a new feature

longwuyuan avatar Dec 03 '21 06:12 longwuyuan

@longwuyuan I wonder if it's by design that canary will take effects in upstream scope, lead to the canary rule can be applied to all ingress objects with the same service (which may not meet our expectation), and in the meantime, a canary service can't be used by another ingress as a primary upstream, cause the controller will not generate rules for it in nginx.conf

We've encountered the problem for times in real production environment. And it's also a good idea to implement it as a new feature, in my opinion.

There's a great solution to the problem #4716 , and I wonder if we can reopen it, or create a new PR based on this work.

Lyt99 avatar Dec 03 '21 06:12 Lyt99

I don't understand how this can be a new feature and not a bug. Service used in a canary ingress can't be used in any other ingress - how come it can be considered as normal? Obviously, this "feature" (canary ingress implications) is currently not document anywhere...

Zyava avatar Dec 03 '21 07:12 Zyava

We can re-apply the bug label when the triaging is completed. Basic canary functionality seems to be working. @theunrealgeek any comments on this

longwuyuan avatar Dec 03 '21 08:12 longwuyuan

/assign

theunrealgeek avatar Dec 13 '21 06:12 theunrealgeek

/triage accepted /priority backlog

strongjz avatar Jan 04 '22 17:01 strongjz

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Apr 04 '22 18:04 k8s-triage-robot

/remove-lifecycle stale

theunrealgeek avatar Apr 08 '22 21:04 theunrealgeek

I think Canary functionality is incomplete without the requested update. A lot of times it is desired to specifically route to canary service (webclient-prod2) so that one can test canary service, pre-prod validations etc. Canary weight at this point would be '0'. Once initial testing is complete we can start using canary features by increasing canary weight. Even if there is a traffic shift enabled , we still would like to continue testing in canary.

vdsharma avatar May 09 '22 19:05 vdsharma

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Aug 07 '22 19:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Sep 06 '22 19:09 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Oct 06 '22 20:10 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Oct 06 '22 20:10 k8s-ci-robot

@Zyava hello! did you find any workarounds for this ?

YannickZ avatar Nov 14 '23 14:11 YannickZ

Nope, unfortunately I don’t know how to workaround this without switching to another ingress controller…Best regards,Dmytro On 14 Nov 2023, at 15:21, Yannick Stevenin @.***> wrote: @Zyava hello! did you find any workarounds for this ?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Zyava avatar Nov 15 '23 07:11 Zyava