spring-cloud-alibaba icon indicating copy to clipboard operation
spring-cloud-alibaba copied to clipboard

在liveness中配置/actuator/health,在应用请求nacos失败时,会导致pod异常重启

Open ZhXZhao opened this issue 2 years ago • 5 comments

我们鼓励使用英文,如果不能直接使用,可以使用翻译软件,您仍旧可以保留中文原文。另外请按照如下要求提交相关信息节省社区维护同学的理解成本,否则该讨论极有可能直接被忽视或关闭。 We recommend using English. If you are non-native English speaker, you can use the translation software. We recommend using English. If you are non-native English speaker, you can use the translation software. In addition, please submit relevant information according to the following requirements to save the understanding cost of community maintenances, otherwise the discussion is very likely to be ignored or closed directly.

Which Component

        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
            <version>2.3.12.RELEASE</version>
        </dependency>

Describe the bug k8s应用在liveness中配置/actuator/health,应用请求nacos失败时,会导致pod异常重启,应用自身状态无异常,可以接受外部请求。

To Reproduce Steps to reproduce the behavior:

  1. 启动一个简单的SpringCloudAlibaba应用,并引入nacos config、discovery以及actuator依赖
       <dependency>
           <groupId>com.alibaba.cloud</groupId>
           <artifactId>spring-cloud-starter-alibaba-nacos-config</artifactId>
       </dependency>
       <dependency>
           <groupId>com.alibaba.cloud</groupId>
           <artifactId>spring-cloud-starter-alibaba-nacos-discovery</artifactId>
       </dependency>
       <dependency>
           <groupId>org.springframework.boot</groupId>
           <artifactId>spring-boot-starter-actuator</artifactId>
       </dependency>
  1. 访问http://localhost:8088/actuator/health
{
    "status": "UP",
    "components": {
        "discoveryComposite": {
            "status": "UP",
            "components": {
                "discoveryClient": {
                    "status": "UP",
                    "details": {
                        "services": [
                            "provider"
                        ]
                    }
                }
            }
        },
        "diskSpace": {
            "status": "UP",
            "details": {
                "total": 499963174912,
                "free": 285202673664,
                "threshold": 10485760,
                "exists": true
            }
        },
        "nacosConfig": {
            "status": "UP"
        },
        "nacosDiscovery": {
            "status": "UP"
        },
        "ping": {
            "status": "UP"
        },
        "refreshScope": {
            "status": "UP"
        }
    }
}
  1. 通过添加网络白名单策略,模拟应用与nacos之间断连的情况,再次访问http://localhost:8088/actuator/health
{
    "status": "DOWN",
    "components": {
        "discoveryComposite": {
            "status": "UP",
            "components": {
                "discoveryClient": {
                    "status": "UP",
                    "details": {
                        "services": []
                    }
                }
            }
        },
        "diskSpace": {
            "status": "UP",
            "details": {
                "total": 499963174912,
                "free": 284207394816,
                "threshold": 10485760,
                "exists": true
            }
        },
        "nacosConfig": {
            "status": "UP"
        },
        "nacosDiscovery": {
            "status": "DOWN"
        },
        "ping": {
            "status": "UP"
        },
        "refreshScope": {
            "status": "UP"
        }
    }
}
  1. 此时actuator判断应用的健康状态为"DOWN",但实际上应用实际仍可以对外提供服务

Expected behavior 在SCA应用与nacos之间连接失败时,/actuator/health状态不应该为"DOWN"

Additional context SCA Version 2.2.9.RELEASE

ZhXZhao avatar Dec 12 '23 07:12 ZhXZhao

In general, the "Liveness" state should not be based on external checks, such as Health checks. If it did, a failing external system (a database, a Web API, an external cache) would trigger massive restarts and cascading failures across the platform.

chickenlj avatar Jan 02 '24 09:01 chickenlj

I think there's no need to change the implementations of NacosConfigHealthIndicator and NacosDiscoveryHealthIndicator, they both work well for telling actuator/health the status of Nacos.

Spring Boot has already provided Liveness and Readiness probes since 2.3.x, users should always use /actuator/health/liveness and /actuator/health/readiness for container lifecycle testing instead of using /actuator/health:

  • https://docs.spring.io/spring-boot/docs/3.2.1/reference/htmlsingle/#features.spring-application.application-availability
  • https://docs.spring.io/spring-boot/docs/3.2.1/reference/htmlsingle/#actuator.endpoints.kubernetes-probes

chickenlj avatar Jan 02 '24 13:01 chickenlj

I think there's no need to change the implementations of NacosConfigHealthIndicator and NacosDiscoveryHealthIndicator, they both work well for telling actuator/health the status of Nacos.

Spring Boot has already provided Liveness and Readiness probes since 2.3.x, users should always use /actuator/health/liveness and /actuator/health/readiness for container lifecycle testing instead of using /actuator/health:

  • https://docs.spring.io/spring-boot/docs/3.2.1/reference/htmlsingle/#features.spring-application.application-availability
  • https://docs.spring.io/spring-boot/docs/3.2.1/reference/htmlsingle/#actuator.endpoints.kubernetes-probes

Agree.

yuluo-yx avatar Jan 03 '24 03:01 yuluo-yx

We need one article on sca.aliyun.com giving best practices for deploying Spring Cloud Alibaba on Kubernetes.

chickenlj avatar Jan 04 '24 07:01 chickenlj

I think there's no need to change the implementations of NacosConfigHealthIndicator and NacosDiscoveryHealthIndicator, they both work well for telling actuator/health the status of Nacos.

Spring Boot has already provided Liveness and Readiness probes since 2.3.x, users should always use /actuator/health/liveness and /actuator/health/readiness for container lifecycle testing instead of using /actuator/health:

  • https://docs.spring.io/spring-boot/docs/3.2.1/reference/htmlsingle/#features.spring-application.application-availability
  • https://docs.spring.io/spring-boot/docs/3.2.1/reference/htmlsingle/#actuator.endpoints.kubernetes-probes

Thanks. Could you give some suggestions for users using spring boot version lower than 2.3.x?

ZhXZhao avatar Jan 08 '24 10:01 ZhXZhao

This issue has been open 30 days with no activity. This will be closed in 7 days.

github-actions[bot] avatar Feb 23 '24 18:02 github-actions[bot]

hi, @ZhXZhao I have written a best practice here. Do you have time to review?

https://github.com/yuluo-yx/sca-k8s-demo/tree/openfeign

yuluo-yx avatar Feb 28 '24 06:02 yuluo-yx

hi, @ZhXZhao I have written a best practice here. Do you have time to review?

https://github.com/yuluo-yx/sca-k8s-demo/tree/openfeign

Okay, that's great

ZhXZhao avatar Mar 26 '24 02:03 ZhXZhao