dapr icon indicating copy to clipboard operation
dapr copied to clipboard

Add components healthcheck endpoint

Open DeepanshuA opened this issue 3 years ago • 5 comments

Description

It adds http://localhost:3500/v1.0/healthz/components/<component_name> and http://localhost:3500/v1.0/healthz/components endpoints for checking health of any registered component. grpc calls are also possible.

Issue reference

Please reference the issue this PR will close: #2167

Checklist

Please make sure you've completed the relevant tasks for this PR, out of the following list:

  • [x] Code compiles correctly
  • [x] Created/updated tests
  • [x] Unit tests passing
  • [x] End-to-end tests passing
  • [ ] Extended the documentation / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
  • [ ] Specification has been updated / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
  • [ ] Provided sample for the feature / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]

DeepanshuA avatar Jun 13 '22 07:06 DeepanshuA

Thanks for your contribution, please fix CI and remove WIP when ready.

daixiang0 avatar Jun 13 '22 08:06 daixiang0

Codecov Report

Merging #4758 (6c53d1b) into master (ccce9e4) will decrease coverage by 0.06%. The diff coverage is 57.76%.

@@            Coverage Diff             @@
##           master    #4758      +/-   ##
==========================================
- Coverage   65.33%   65.28%   -0.06%     
==========================================
  Files         151      151              
  Lines       15732    15893     +161     
==========================================
+ Hits        10279    10376      +97     
- Misses       4734     4793      +59     
- Partials      719      724       +5     
Impacted Files Coverage Δ
pkg/grpc/endpoints.go 100.00% <ø> (ø)
utils/utils.go 59.61% <ø> (ø)
pkg/grpc/api.go 69.49% <52.70%> (-1.05%) :arrow_down:
pkg/http/api.go 71.06% <57.14%> (-0.62%) :arrow_down:
pkg/http/responses.go 92.30% <100.00%> (+2.30%) :arrow_up:
pkg/runtime/runtime.go 67.36% <100.00%> (+0.07%) :arrow_up:

... and 1 file with indirect coverage changes

codecov[bot] avatar Jun 14 '22 14:06 codecov[bot]

As per the community call, I understand two updates to health api:

  1. Removing type from error/errorCode/message.
  2. Passing more info in error in http, just like in grpc

Please confirm, if any other change would be required as well.

Sharing here the results for various possibilities, so that if any other changes are required, they can be also addressed at once:

GET http://localhost:3500/v1.0-alpha1/healthz/components { "results": [ { "componentName": "orderpubsub", "type": "pubsub", "status": "Undefined", "error": "ERR_PING_NOT_IMPLEMENTED_BY_pubsub" }, { "componentName": "productpubsub", "type": "pubsub", "status": "Ok" }, { "componentName": "smsbinding", "type": "bindings", "status": "Not_Ok", "error": "ERR_bindings_HEALTH_NOT_OK" }, { "componentName": "txnstore", "type": "state", "status": "Not_Ok", "error": "ERR_state_HEALTH_NOT_OK" } ] }

GET http://localhost:3500/v1.0-alpha1/healthz/components/orderpubsub 501 Not Implemented { "errorCode": "ERR_PING_NOT_IMPLEMENTED_BY_orderpubsub", "message": "Ping is not imeplemented by orderpubsub" }

GET http://localhost:3500/v1.0-alpha1/healthz/components/orderpubsub1 400 Bad Request { "errorCode": "ERR_COMPONENT_WITH_NAME_orderpubsub1_NOT_FOUND", "message": "Component With Name orderpubsub1 is not found" }

GET http://localhost:3500/v1.0-alpha1/healthz/components/productpubsub 204 No Content

GET http://localhost:3500/v1.0-alpha1/healthz/components/txnstore 500 internal Server Error { "errorCode": "ERR_STATE_HEALTH_NOT_OK", "message": "txnstore is not ok" }

grpc: CheckAllComponentsHealthAlpha1 Input message: {} Response: { "results": [ { "componentName": "orderpubsub", "type": "pubsub", "status": "undefined", "error": "rpc error: code = Unimplemented desc = Ping is not imeplemented by orderpubsub" }, { "componentName": "productpubsub", "type": "pubsub", "status": "ok" }, { "componentName": "smsbinding", "type": "bindings", "status": "not_ok", "error": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused" }, { "componentName": "txnstore", "type": "state", "status": "not_ok", "error": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused" } ] }

CheckHealthAlpha1 Input message: { "component_name": "orderpubsub" } Response: 12 UNIMPLEMENTED

CheckHealthAlpha1 Input message: { "component_name": "orderpubsub1" } Response: 3 INVALID_ARGUMENT

CheckHealthAlpha1 Input message: { "component_name": "productpubsub" } Response: 0 OK

CheckHealthAlpha1 Input message: { "component_name": "txnstore" } Response: 2 UNKNOWN

@yaron2 @artursouza

DeepanshuA avatar Jul 26 '22 18:07 DeepanshuA

error": "ERR_PING_NOT_IMPLEMENTED_BY_pubsub"

LGTM.

For error codes and status use ALL CAPS and keep error codes small, like ERR_COMPONENT_NOT_FOUND

artursouza avatar Aug 24 '22 03:08 artursouza

Please review. As discussed, I have changed the responses and following are current ones:

============================================

For http:

Enquire for all components:

When few components don't have Ping Implemented and others are fine:

GET http://localhost:3500/v1.0-alpha1/healthz/components

200 OK

{
    "results": [
        {
            "componentName": "kafkaComp",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "orderpubsub",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "productpubsub",
            "type": "pubsub",
            "status": "OK"
        },
        {
            "componentName": "smsbinding",
            "type": "bindings",
            "status": "OK"
        },
        {
            "componentName": "txnstore",
            "type": "state",
            "status": "OK"
        }
    ]
}

Enquire for all components:

When few components don't have Ping Implemented and others are NOT fine:

GET http://localhost:3500/v1.0-alpha1/healthz/components

{
    "results": [
        {
            "componentName": "kafkaComp",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "orderpubsub",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "productpubsub",
            "type": "pubsub",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        },
        {
            "componentName": "smsbinding",
            "type": "bindings",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        },
        {
            "componentName": "txnstore",
            "type": "state",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        }
    ]
}

Enquire for one component:

GET http://localhost:3500/v1.0-alpha1/healthz/components/orderpubsub 405 Method Not Allowed

{
    "status": "UNDEFINED",
    "errorCode": "ERR_PING_NOT_IMPLEMENTED"
}

GET http://localhost:3500/v1.0-alpha1/healthz/components/productpubsub 500 Internal Server Error

{
    "status": "NOT OK",
    "errorCode": "ERR_HEALTH_NOT_OK",
    "message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
}

GET http://localhost:3500/v1.0-alpha1/healthz/components/wrongComp 400 Bad Request

{
    "status": "UNDEFINED",
    "errorCode": "ERR_COMPONENT_NOT_FOUND"
}

GET http://localhost:3500/v1.0-alpha1/healthz/components/txnstore 200 OK

{
    "status": "OK"
}

gRPC:

Enquire for all components:

When few components don't have Ping Implemented and others are fine:

GetAllComponentsHealthAlpha1 0:OK

{
    "results": [
        {
            "componentName": "kafkaComp",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "orderpubsub",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "productpubsub",
            "type": "pubsub",
            "status": "OK"
        },
        {
            "componentName": "smsbinding",
            "type": "bindings",
            "status": "OK"
        },
        {
            "componentName": "txnstore",
            "type": "state",
            "status": "OK"
        }
    ]
}

Enquire for all components:

When few components don't have Ping Implemented and others are NOT fine:

GetAllComponentsHealthAlpha1 0:OK

{
    "results": [
        {
            "componentName": "kafkaComp",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "orderpubsub",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "productpubsub",
            "type": "pubsub",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        },
        {
            "componentName": "smsbinding",
            "type": "bindings",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        },
        {
            "componentName": "txnstore",
            "type": "state",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        }
    ]
}

Enquire for one component:

GetComponentHealthAlpha1
{
    "component_name": "orderpubsub"
}
Response: 12 UNIMPLEMENTED
(ERR_PING_NOT_IMPLEMENTED)
GetComponentHealthAlpha1
{
    "component_name": "productpubsub"
}
Response: 2 UNKNOWN
(ERR_HEALTH_NOT_OK)
GetComponentHealthAlpha1
{
    "component_name": "wrongComp"
}
Response: 3 INVALID_ARGUMENT
(ERR_COMPONENT_NOT_FOUND)
GetComponentHealthAlpha1
{
    "component_name": "txnstore"
}
Response: 0 OK
{
    "status": "OK"
}

DeepanshuA avatar Sep 20 '22 22:09 DeepanshuA

This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

dapr-bot avatar Dec 04 '22 01:12 dapr-bot

/keep-alive

mukundansundar avatar Dec 05 '22 04:12 mukundansundar

@yaron2 @mukundansundar Please re-review.

DeepanshuA avatar Dec 26 '22 11:12 DeepanshuA

Please review. As per the comment above, I have changed the request and following are current ones:

============================================

For http:

Enquire for all components:

When few components don't have Ping Implemented and others are fine:

GET http://localhost:3500/v1.0-alpha1/healthz/component

200 OK

{
    "result": [
        {
            "componentName": "kafkaComp",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "orderpubsub",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "productpubsub",
            "type": "pubsub",
            "status": "OK"
        },
        {
            "componentName": "smsbinding",
            "type": "bindings",
            "status": "OK"
        },
        {
            "componentName": "txnstore",
            "type": "state",
            "status": "OK"
        }
    ]
}

Enquire for all components:

When few components don't have Ping Implemented and others are NOT fine:

GET http://localhost:3500/v1.0-alpha1/healthz/component

{
    "result": [
        {
            "componentName": "kafkaComp",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "orderpubsub",
            "type": "pubsub",
            "status": "UNDEFINED",
            "errorCode": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "componentName": "productpubsub",
            "type": "pubsub",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        },
        {
            "componentName": "smsbinding",
            "type": "bindings",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        },
        {
            "componentName": "txnstore",
            "type": "state",
            "status": "NOT OK",
            "errorCode": "ERR_HEALTH_NOT_OK",
            "message": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        }
    ]
}

Enquire for one component:

GET http://localhost:3500/v1.0-alpha1/healthz/component?componentName=orderpubsub 405 Method Not Allowed

{
    "status": "UNDEFINED",
    "errorCode": "ERR_PING_NOT_IMPLEMENTED"
}

GET http://localhost:3500/v1.0-alpha1/healthz/component?componentName=productpubsub 500 Internal Server Error

{
    "status": "NOT OK",
    "errorCode": "ERR_HEALTH_NOT_OK",
    "message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
}

GET http://localhost:3500/v1.0-alpha1/healthz/component?componentName=wrongComp 400 Bad Request

{
    "status": "UNDEFINED",
    "errorCode": "ERR_COMPONENT_NOT_FOUND"
}

GET http://localhost:3500/v1.0-alpha1/healthz/component?componentName=txnstore 200 OK

{
    "status": "OK"
}

gRPC:

Enquire for all components:

When few components don't have Ping Implemented and others are fine:

GetComponentHealthAlpha1 0:OK

{
    "result": [
        {
            "component_name": "kafkaComp",
            "type": "pubsub",
            "status": "UNDEFINED",
            "error_code": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "component_name": "orderpubsub",
            "type": "pubsub",
            "status": "UNDEFINED",
            "error_code": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "component_name": "productpubsub",
            "type": "pubsub",
            "status": "OK"
        },
        {
            "component_name": "smsbinding",
            "type": "bindings",
            "status": "OK"
        },
        {
            "component_name": "txnstore",
            "type": "state",
            "status": "OK"
        }
    ]
}

Enquire for all components:

When few components don't have Ping Implemented and others are NOT fine:

GetComponentHealthAlpha1 0:OK

{
    "result": [
        {
            "component_name": "kafkaComp",
            "type": "pubsub",
            "status": "UNDEFINED",
            "error_code": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "component_name": "orderpubsub",
            "type": "pubsub",
            "status": "UNDEFINED",
            "error_code": "ERR_PING_NOT_IMPLEMENTED"
        },
        {
            "component_name": "productpubsub",
            "type": "pubsub",
            "status": "NOT OK",
            "error_code": "ERR_HEALTH_NOT_OK",
            "message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        },
        {
            "component_name": "smsbinding",
            "type": "bindings",
            "status": "NOT OK",
            "error_code": "ERR_HEALTH_NOT_OK",
            "message": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        },
        {
            "component_name": "txnstore",
            "type": "state",
            "status": "NOT OK",
            "error_code": "ERR_HEALTH_NOT_OK",
            "message": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
        }
    ]
}

Enquire for one component:

GetComponentHealthAlpha1
{
    "component_name": "orderpubsub"
}
Response: 12 UNIMPLEMENTED
(ERR_PING_NOT_IMPLEMENTED)
GetComponentHealthAlpha1
{
    "component_name": "productpubsub"
}
Response: 2 UNKNOWN
(ERR_HEALTH_NOT_OK)
GetComponentHealthAlpha1
{
    "component_name": "wrongComp"
}
Response: 3 INVALID_ARGUMENT
(ERR_COMPONENT_NOT_FOUND)
GetComponentHealthAlpha1
{
    "component_name": "txnstore"
}
Response: 0 OK
{
    "result": [
        {
            "status": "OK"
        }
    ]
}

DeepanshuA avatar Jan 09 '23 15:01 DeepanshuA

@artursouza Kindly re-review. Thanks.

DeepanshuA avatar Jan 09 '23 16:01 DeepanshuA

Ping @dapr/approvers-dapr @dapr/maintainers-dapr

DeepanshuA avatar Jan 10 '23 08:01 DeepanshuA

ok, same problem as I faced in other PRs; on requesting re-review from someone, review request from others is removed. So, requesting for re-review via comment. @artursouza @ItalyPaleAle

DeepanshuA avatar Jan 11 '23 18:01 DeepanshuA

I had a discussion with @artursouza and @ItalyPaleAle few days ago around this PR.

It was discussed that - What are the actual use cases which https://github.com/dapr/dapr/issues/2167 tries to solve. One of the most required case would be for someone running in Production and requiring to check components' health.

For that,

  1. only http endpoint would be required
  2. It would be most useful if HTTP 200 OK status is sent if all components are fine or else, report something like 500: Internal Server Error with payload telling failed components.
  3. There would be similar requirement in future for actors, or for other items as well, so instead of a new url for every item, it would be better to reuse existing Dapr health endpoint with query param like include_components=true.

Now, issue that comes to satisfy above mentioned points: Right now, all components don't implement Ping. So, a) If we report something like ERR_PING_NOT_IMPLEMENTED, we will end up having almost always 500 in reponse. b) If we decide not to include those components at all which don't implement Ping, then a user who hasn't dug deep enough and doesn't know that what components don't implement Ping, would get a false positive for those components - he would be able to see that component is not working as desired but dapr doesn't report it as failure - this will directly lead to very bad user experience. 

So, if we want to move ahead on the lines that we decided, it seems this API makes sense only after all components implement Ping.  In essence, I will first try to gather the usecase and only then be able to decide the correct nature of this API. Accordingly, if this API seems required and will be distinct from what the original issue was, then probably I will raise another issue and modify existing / raise new PR.

With these thoughts here, I request @dapr/maintainers-dapr or @dapr/release-team to remove it from 1.10 and NOT schedule it for now for 1.11 also.

DeepanshuA avatar Jan 30 '23 07:01 DeepanshuA

This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

dapr-bot avatar Mar 31 '23 07:03 dapr-bot

👋🤖

ItalyPaleAle avatar Mar 31 '23 19:03 ItalyPaleAle

This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

dapr-bot avatar May 30 '23 20:05 dapr-bot

This pull request has been automatically closed because it has not had activity in the last 67 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

dapr-bot avatar Jun 06 '23 20:06 dapr-bot

This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

dapr-bot avatar Aug 21 '23 14:08 dapr-bot

This pull request has been automatically closed because it has not had activity in the last 67 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

dapr-bot avatar Aug 28 '23 14:08 dapr-bot