Add components healthcheck endpoint
Description
It adds http://localhost:3500/v1.0/healthz/components/<component_name> and http://localhost:3500/v1.0/healthz/components endpoints for checking health of any registered component. grpc calls are also possible.
Issue reference
Please reference the issue this PR will close: #2167
Checklist
Please make sure you've completed the relevant tasks for this PR, out of the following list:
- [x] Code compiles correctly
- [x] Created/updated tests
- [x] Unit tests passing
- [x] End-to-end tests passing
- [ ] Extended the documentation / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
- [ ] Specification has been updated / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
- [ ] Provided sample for the feature / Created issue in the https://github.com/dapr/docs/ repo: dapr/docs#[issue number]
Thanks for your contribution, please fix CI and remove WIP when ready.
Codecov Report
Merging #4758 (6c53d1b) into master (ccce9e4) will decrease coverage by
0.06%. The diff coverage is57.76%.
@@ Coverage Diff @@
## master #4758 +/- ##
==========================================
- Coverage 65.33% 65.28% -0.06%
==========================================
Files 151 151
Lines 15732 15893 +161
==========================================
+ Hits 10279 10376 +97
- Misses 4734 4793 +59
- Partials 719 724 +5
| Impacted Files | Coverage Δ | |
|---|---|---|
| pkg/grpc/endpoints.go | 100.00% <ø> (ø) |
|
| utils/utils.go | 59.61% <ø> (ø) |
|
| pkg/grpc/api.go | 69.49% <52.70%> (-1.05%) |
:arrow_down: |
| pkg/http/api.go | 71.06% <57.14%> (-0.62%) |
:arrow_down: |
| pkg/http/responses.go | 92.30% <100.00%> (+2.30%) |
:arrow_up: |
| pkg/runtime/runtime.go | 67.36% <100.00%> (+0.07%) |
:arrow_up: |
As per the community call, I understand two updates to health api:
- Removing type from error/errorCode/message.
- Passing more info in error in http, just like in grpc
Please confirm, if any other change would be required as well.
Sharing here the results for various possibilities, so that if any other changes are required, they can be also addressed at once:
GET http://localhost:3500/v1.0-alpha1/healthz/components { "results": [ { "componentName": "orderpubsub", "type": "pubsub", "status": "Undefined", "error": "ERR_PING_NOT_IMPLEMENTED_BY_pubsub" }, { "componentName": "productpubsub", "type": "pubsub", "status": "Ok" }, { "componentName": "smsbinding", "type": "bindings", "status": "Not_Ok", "error": "ERR_bindings_HEALTH_NOT_OK" }, { "componentName": "txnstore", "type": "state", "status": "Not_Ok", "error": "ERR_state_HEALTH_NOT_OK" } ] }
GET http://localhost:3500/v1.0-alpha1/healthz/components/orderpubsub 501 Not Implemented { "errorCode": "ERR_PING_NOT_IMPLEMENTED_BY_orderpubsub", "message": "Ping is not imeplemented by orderpubsub" }
GET http://localhost:3500/v1.0-alpha1/healthz/components/orderpubsub1 400 Bad Request { "errorCode": "ERR_COMPONENT_WITH_NAME_orderpubsub1_NOT_FOUND", "message": "Component With Name orderpubsub1 is not found" }
GET http://localhost:3500/v1.0-alpha1/healthz/components/productpubsub 204 No Content
GET http://localhost:3500/v1.0-alpha1/healthz/components/txnstore 500 internal Server Error { "errorCode": "ERR_STATE_HEALTH_NOT_OK", "message": "txnstore is not ok" }
grpc: CheckAllComponentsHealthAlpha1 Input message: {} Response: { "results": [ { "componentName": "orderpubsub", "type": "pubsub", "status": "undefined", "error": "rpc error: code = Unimplemented desc = Ping is not imeplemented by orderpubsub" }, { "componentName": "productpubsub", "type": "pubsub", "status": "ok" }, { "componentName": "smsbinding", "type": "bindings", "status": "not_ok", "error": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused" }, { "componentName": "txnstore", "type": "state", "status": "not_ok", "error": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused" } ] }
CheckHealthAlpha1 Input message: { "component_name": "orderpubsub" } Response: 12 UNIMPLEMENTED
CheckHealthAlpha1 Input message: { "component_name": "orderpubsub1" } Response: 3 INVALID_ARGUMENT
CheckHealthAlpha1 Input message: { "component_name": "productpubsub" } Response: 0 OK
CheckHealthAlpha1 Input message: { "component_name": "txnstore" } Response: 2 UNKNOWN
@yaron2 @artursouza
error": "ERR_PING_NOT_IMPLEMENTED_BY_pubsub"
LGTM.
For error codes and status use ALL CAPS and keep error codes small, like ERR_COMPONENT_NOT_FOUND
Please review. As discussed, I have changed the responses and following are current ones:
============================================
For http:
Enquire for all components:
When few components don't have Ping Implemented and others are fine:
GET http://localhost:3500/v1.0-alpha1/healthz/components
200 OK
{
"results": [
{
"componentName": "kafkaComp",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "orderpubsub",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "productpubsub",
"type": "pubsub",
"status": "OK"
},
{
"componentName": "smsbinding",
"type": "bindings",
"status": "OK"
},
{
"componentName": "txnstore",
"type": "state",
"status": "OK"
}
]
}
Enquire for all components:
When few components don't have Ping Implemented and others are NOT fine:
GET http://localhost:3500/v1.0-alpha1/healthz/components
{
"results": [
{
"componentName": "kafkaComp",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "orderpubsub",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "productpubsub",
"type": "pubsub",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
},
{
"componentName": "smsbinding",
"type": "bindings",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
},
{
"componentName": "txnstore",
"type": "state",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
}
]
}
Enquire for one component:
GET http://localhost:3500/v1.0-alpha1/healthz/components/orderpubsub 405 Method Not Allowed
{
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
}
GET http://localhost:3500/v1.0-alpha1/healthz/components/productpubsub 500 Internal Server Error
{
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
}
GET http://localhost:3500/v1.0-alpha1/healthz/components/wrongComp 400 Bad Request
{
"status": "UNDEFINED",
"errorCode": "ERR_COMPONENT_NOT_FOUND"
}
GET http://localhost:3500/v1.0-alpha1/healthz/components/txnstore 200 OK
{
"status": "OK"
}
gRPC:
Enquire for all components:
When few components don't have Ping Implemented and others are fine:
GetAllComponentsHealthAlpha1 0:OK
{
"results": [
{
"componentName": "kafkaComp",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "orderpubsub",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "productpubsub",
"type": "pubsub",
"status": "OK"
},
{
"componentName": "smsbinding",
"type": "bindings",
"status": "OK"
},
{
"componentName": "txnstore",
"type": "state",
"status": "OK"
}
]
}
Enquire for all components:
When few components don't have Ping Implemented and others are NOT fine:
GetAllComponentsHealthAlpha1 0:OK
{
"results": [
{
"componentName": "kafkaComp",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "orderpubsub",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "productpubsub",
"type": "pubsub",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
},
{
"componentName": "smsbinding",
"type": "bindings",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
},
{
"componentName": "txnstore",
"type": "state",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
}
]
}
Enquire for one component:
GetComponentHealthAlpha1
{
"component_name": "orderpubsub"
}
Response: 12 UNIMPLEMENTED
(ERR_PING_NOT_IMPLEMENTED)
GetComponentHealthAlpha1
{
"component_name": "productpubsub"
}
Response: 2 UNKNOWN
(ERR_HEALTH_NOT_OK)
GetComponentHealthAlpha1
{
"component_name": "wrongComp"
}
Response: 3 INVALID_ARGUMENT
(ERR_COMPONENT_NOT_FOUND)
GetComponentHealthAlpha1
{
"component_name": "txnstore"
}
Response: 0 OK
{
"status": "OK"
}
This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
/keep-alive
@yaron2 @mukundansundar Please re-review.
Please review. As per the comment above, I have changed the request and following are current ones:
============================================
For http:
Enquire for all components:
When few components don't have Ping Implemented and others are fine:
GET http://localhost:3500/v1.0-alpha1/healthz/component
200 OK
{
"result": [
{
"componentName": "kafkaComp",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "orderpubsub",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "productpubsub",
"type": "pubsub",
"status": "OK"
},
{
"componentName": "smsbinding",
"type": "bindings",
"status": "OK"
},
{
"componentName": "txnstore",
"type": "state",
"status": "OK"
}
]
}
Enquire for all components:
When few components don't have Ping Implemented and others are NOT fine:
GET http://localhost:3500/v1.0-alpha1/healthz/component
{
"result": [
{
"componentName": "kafkaComp",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "orderpubsub",
"type": "pubsub",
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
},
{
"componentName": "productpubsub",
"type": "pubsub",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
},
{
"componentName": "smsbinding",
"type": "bindings",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
},
{
"componentName": "txnstore",
"type": "state",
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
}
]
}
Enquire for one component:
GET http://localhost:3500/v1.0-alpha1/healthz/component?componentName=orderpubsub 405 Method Not Allowed
{
"status": "UNDEFINED",
"errorCode": "ERR_PING_NOT_IMPLEMENTED"
}
GET http://localhost:3500/v1.0-alpha1/healthz/component?componentName=productpubsub 500 Internal Server Error
{
"status": "NOT OK",
"errorCode": "ERR_HEALTH_NOT_OK",
"message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
}
GET http://localhost:3500/v1.0-alpha1/healthz/component?componentName=wrongComp 400 Bad Request
{
"status": "UNDEFINED",
"errorCode": "ERR_COMPONENT_NOT_FOUND"
}
GET http://localhost:3500/v1.0-alpha1/healthz/component?componentName=txnstore 200 OK
{
"status": "OK"
}
gRPC:
Enquire for all components:
When few components don't have Ping Implemented and others are fine:
GetComponentHealthAlpha1 0:OK
{
"result": [
{
"component_name": "kafkaComp",
"type": "pubsub",
"status": "UNDEFINED",
"error_code": "ERR_PING_NOT_IMPLEMENTED"
},
{
"component_name": "orderpubsub",
"type": "pubsub",
"status": "UNDEFINED",
"error_code": "ERR_PING_NOT_IMPLEMENTED"
},
{
"component_name": "productpubsub",
"type": "pubsub",
"status": "OK"
},
{
"component_name": "smsbinding",
"type": "bindings",
"status": "OK"
},
{
"component_name": "txnstore",
"type": "state",
"status": "OK"
}
]
}
Enquire for all components:
When few components don't have Ping Implemented and others are NOT fine:
GetComponentHealthAlpha1 0:OK
{
"result": [
{
"component_name": "kafkaComp",
"type": "pubsub",
"status": "UNDEFINED",
"error_code": "ERR_PING_NOT_IMPLEMENTED"
},
{
"component_name": "orderpubsub",
"type": "pubsub",
"status": "UNDEFINED",
"error_code": "ERR_PING_NOT_IMPLEMENTED"
},
{
"component_name": "productpubsub",
"type": "pubsub",
"status": "NOT OK",
"error_code": "ERR_HEALTH_NOT_OK",
"message": "redis pubsub: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
},
{
"component_name": "smsbinding",
"type": "bindings",
"status": "NOT OK",
"error_code": "ERR_HEALTH_NOT_OK",
"message": "redis binding: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
},
{
"component_name": "txnstore",
"type": "state",
"status": "NOT OK",
"error_code": "ERR_HEALTH_NOT_OK",
"message": "redis store: error connecting to redis at localhost:6379: dial tcp 127.0.0.1:6379: connect: connection refused"
}
]
}
Enquire for one component:
GetComponentHealthAlpha1
{
"component_name": "orderpubsub"
}
Response: 12 UNIMPLEMENTED
(ERR_PING_NOT_IMPLEMENTED)
GetComponentHealthAlpha1
{
"component_name": "productpubsub"
}
Response: 2 UNKNOWN
(ERR_HEALTH_NOT_OK)
GetComponentHealthAlpha1
{
"component_name": "wrongComp"
}
Response: 3 INVALID_ARGUMENT
(ERR_COMPONENT_NOT_FOUND)
GetComponentHealthAlpha1
{
"component_name": "txnstore"
}
Response: 0 OK
{
"result": [
{
"status": "OK"
}
]
}
@artursouza Kindly re-review. Thanks.
Ping @dapr/approvers-dapr @dapr/maintainers-dapr
ok, same problem as I faced in other PRs; on requesting re-review from someone, review request from others is removed. So, requesting for re-review via comment. @artursouza @ItalyPaleAle
I had a discussion with @artursouza and @ItalyPaleAle few days ago around this PR.
It was discussed that - What are the actual use cases which https://github.com/dapr/dapr/issues/2167 tries to solve. One of the most required case would be for someone running in Production and requiring to check components' health.
For that,
- only http endpoint would be required
- It would be most useful if HTTP 200 OK status is sent if all components are fine or else, report something like 500: Internal Server Error with payload telling failed components.
- There would be similar requirement in future for actors, or for other items as well, so instead of a new url for every item, it would be better to reuse existing Dapr health endpoint with query param like
include_components=true.
Now, issue that comes to satisfy above mentioned points: Right now, all components don't implement Ping. So, a) If we report something like ERR_PING_NOT_IMPLEMENTED, we will end up having almost always 500 in reponse. b) If we decide not to include those components at all which don't implement Ping, then a user who hasn't dug deep enough and doesn't know that what components don't implement Ping, would get a false positive for those components - he would be able to see that component is not working as desired but dapr doesn't report it as failure - this will directly lead to very bad user experience.
So, if we want to move ahead on the lines that we decided, it seems this API makes sense only after all components implement Ping. In essence, I will first try to gather the usecase and only then be able to decide the correct nature of this API. Accordingly, if this API seems required and will be distinct from what the original issue was, then probably I will raise another issue and modify existing / raise new PR.
With these thoughts here, I request @dapr/maintainers-dapr or @dapr/release-team to remove it from 1.10 and NOT schedule it for now for 1.11 also.
This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
👋🤖
This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
This pull request has been automatically closed because it has not had activity in the last 67 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
This pull request has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!
This pull request has been automatically closed because it has not had activity in the last 67 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!