consul-esm
consul-esm copied to clipboard
Support Rotating ACL Tokens
When a consul-esm instance's token is revoked, maybe from rotating acl tokens, there are some unexpected outcomes for consul-esm:
- the instance's status remains passing/healthy and is never marked critical. This can be seen at
/v1/health/node/:node
- the instance's assigned external health checks are not successfully executed. as a result of staying "passing"/"healthy", the instance's assigned external health checks are not reassigned to other actually healthy instances with appropriate tokens
- the instance is not able to successfully deregister
The revoked token is needed to update the health check and deregister. This is expected as a result of anti-entropy.
The larger issue around supporting rotating acl tokens is already captured in https://github.com/hashicorp/consul/issues/4372. The recommendation is to reregister the application (consul-esm in this case) with the new token.
Currently, consul-esm doesn't have a way to reregister itself. On stopping and restarting consul-esm, the stopped instance will fail to deregister while the newly started instance will obtain a new id. This leads to having 'dead', floating consul-esm instances in the cluster. A serious consequence is that these dead consul-esm instances retain responsibility for their external health checks since they remain marked as healthy/passing in the catalog.
This issue arises from comment: https://github.com/hashicorp/consul-esm/issues/39#issuecomment-567750936
Steps to reproduce
- Start consul (I used v1.6.2) with ACLs enabled
- Register two external health checks
- Start consul-esm (I used v0.3.3) with relevant token needed to operate and
log_level=DEBUG
- Start another consul-esm with a different token needed to operate and
log_level=DEBUG
- Observe that each consul-esm is executing one of the external health checks
- Delete token for one of the consul-esms
- Observe in consul-logs that revoked-token consul-esm has failed its TTL check
- Query
/v1/health/node/<revoked-token-consul-esm-id>
and see that the status is still passing - Stop revoked-token consul-esm instance (Control+C)
- Observe in consul-logs that consul-esm was not able to successfully deregister
- Observe in remaining healthy consul-esm instance that it is executing only one external health check - the one it was originally assigned - and it did not inherit the other external health check