dskit
dskit copied to clipboard
Crash immediately when critical services cannot start.
We should be able to mark services as critical. If such as service fails to load the complete startup process should fail immediately rather than reporting a dependency failing. Ideally the exit indicates which service failed.
How is this different than the current behavior?
Currently, it's not immediately apparent where the error came from. Take Loki's service error handling as an example. It logs the error and propagates it. This makes debugging harder. I wish the critical service would panic immediately so it's easier to see what's wrong instead of tracing the service logs.
Could you share an example of a log message you're currently seeing when a service fail and how you would like to improve it? I'm asking cause I think what you're looking for is more related to improve the error handling in Loki than changing the service framework.
Sure, this is an example
level=info ts=2021-09-30T14:38:59.028627832Z caller=module_service.go:64 msg=initialising module=runtime-config
level=error ts=2021-09-30T14:38:59.028908303Z caller=loki.go:292 msg="module failed" module=runtime-config error="invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: yaml: unmarshal errors:\n line 9: field max_stream_burst_rate_bytes not found in type validation.plain\n line 10: field max_stream_rate_bytes not found in type validation.plain"
level=debug ts=2021-09-30T14:38:59.02894646Z caller=module_service.go:86 msg=stopping module=license
level=info ts=2021-09-30T14:38:59.028960307Z caller=module_service.go:96 msg="module stopped" module=license
level=error ts=2021-09-30T14:38:59.028981814Z caller=loki.go:292 msg="module failed" module=ring error="failed to start ring, because it depends on module runtime-config, which has failed: invalid service state: Failed, expected: Running, failure: invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: yaml: unmarshal errors:\n line 9: field max_stream_burst_rate_bytes not found in type validation.plain\n line 10: field max_stream_rate_bytes not found in type validation.plain"
level=debug ts=2021-09-30T14:38:59.028998621Z caller=module_service.go:109 msg="module waiting for" module=server waiting_for=ring
level=debug ts=2021-09-30T14:38:59.029009668Z caller=module_service.go:109 msg="module waiting for" module=server waiting_for=distributor
level=error ts=2021-09-30T14:38:59.029027374Z caller=loki.go:292 msg="module failed" module=distributor error="failed to start distributor, because it depends on module ring, which has failed: context canceled"
level=debug ts=2021-09-30T14:38:59.029044629Z caller=module_service.go:109 msg="module waiting for" module=memberlist-kv waiting_for=ring
level=debug ts=2021-09-30T14:38:59.029055213Z caller=module_service.go:86 msg=stopping module=memberlist-kv
level=info ts=2021-09-30T14:38:59.029069775Z caller=module_service.go:96 msg="module stopped" module=memberlist-kv
level=debug ts=2021-09-30T14:38:59.029089185Z caller=module_service.go:86 msg=stopping module=server
level=info ts=2021-09-30T14:38:59.0303212Z caller=server_service.go:50 msg="server stopped"
level=info ts=2021-09-30T14:38:59.03037297Z caller=module_service.go:96 msg="module stopped" module=server
level=info ts=2021-09-30T14:38:59.030388749Z caller=loki.go:281 msg="Loki stopped"
level=error ts=2021-09-30T14:38:59.030468496Z caller=log.go:106 msg="error running Grafana Enterprise Logs" err="failed services\ngithub.com/grafana/loki/pkg/loki.(*Loki).Run\n\t/src/enterprise-logs/vendor/github.com/grafana/loki/pkg/loki/loki.go:327\ngithub.com/grafana/loki-enterprise/pkg/enterprise/loki/init.(*LokiEnterprise).Run\n\t/src/enterprise-logs/pkg/enterprise/loki/init/loki.go:232\nmain.main\n\t/src/enterprise-logs/cmd/enterprise-logs/main.go:145\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:225\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371"
The actual error is a few lines up
level=error ts=2021-09-30T14:38:59.028981814Z caller=loki.go:292 msg="module failed" module=ring error="failed to start ring, because it depends on module runtime-config, which has failed: invalid service state: Failed, expected: Running, failure: invalid service state: Failed, expected: Running, failure: failed to load runtime config: load file: yaml: unmarshal errors:\n line 9: field max_stream_burst_rate_bytes not found in type validation.plain\n line 10: field max_stream_rate_bytes not found in type validation.plain"
Ideally the last message in the log would be something like|
Critical module runtime-config failed to start: config schema does not match expected schema. Invalid field `max_stream_rate_bytes`.
exit code 42
I agree that the error message itself could be improved. However, the important bit here is that failed to start ring, because it depends on module runtime-config, which has failed is already too much. Why would I see this error at all if runtime-config already failed? Nothing should even try to start if a critical module such as runtime-config failed.
exit code 42
If you want to abruptly exit with no graceful shutdown (even on error, which is what we're doing right now) then you would need to do this change to Loki. I will argue whether this could be a good idea or not, but in any case looks like something to address in the application and not dskit.
Hm. So you mean via a service failed listener such as this one?
This can already be implemented by getting map of failed services from ServiceManager and reporting their errors (see ServicesByState and check for Failed state).