logstash
logstash copied to clipboard
RFC: Health Report API
As: a person responsible for ensuring Logstash pipelines are running without issue I want: a health report similar to the one provided by Elasticsearch So that: I can easily identify issues with the health of a Logstash process and its pipelines
Phase 1: API & Initial Indicators
This is a plan to deliver a GET /_health_report
endpoint that is heavily inspired by the shape of Elasticsearch's endpoint of the same name, adhering to its prior art wherever possible.
In Elasticsearch, the health report endpoint presents a flat collection of indicators
and a status
that reflects the status of the least-healthy indicator in the collection. Each indicator has its own status
and optional symptom
, and may optionally provide information about the impacts
of a less-than-healthy status, one or more diagnosis
with information about the cause, actions-to-remediate, and links to help, and/or details relevant to that particular indicator.
Because many aspects of Logstash processing and the actionable insights required by operational administrators are pipeline-centric, we will introduce a top-level indicator called pipelines
that will have a mapping of sub-indicators
, one for each pipeline. Like the top-level #/status
, the value of #/indicators/pipelines/status
will bubble up the least-healthy status of its component indicators.
The Logstash agent will maintain a pipeline-indicator for each pipeline that the agent knows about, and each pipeline-indicator will have one or more probes that are capable of marking the indicator as unhealthy with a diagnosis
and optional impacts
.
Proposed Schema:
Click to expand Schema
--- $schema: https://json-schema.org/draft/2020-12/schema $id: health_report type: object title: Logstash Health Report description: >- Information about Logstash, its health status, and the indicators that were used to determine the current status. properties: host: type: string version: type: string snapshot: type: boolean name: type: string ephemeral_id: type: string allOf: - $ref: 'schema:has_status' required: ['status'] - $ref: 'schema:has_indicators' required: ['indicators'] unevaluatedProperties: false $defs: indicator: $id: schema:indicator description: >- An indicator has a `status` and a `symptom`, and may provide `details`, `impacts`, and/or `diagnosis`, or may be composed of internal `indicators` allOf: - $ref: 'schema:has_status' required: ['status'] - $ref: 'schema:has_symptom' - $ref: 'schema:has_details' - $ref: 'schema:has_impacts' - $ref: 'schema:has_diagnosis' - $ref: 'schema:has_indicators' # NOTE: indicator/has_indicators is the only addition to the unevaluatedProperties: false has_status: $id: schema:has_status type: object properties: status: type: string description: >- Health status oneOf: - const: green description: everything is okay - const: unknown description: status could not be determined - const: yellow description: functionality is in a degraded state and may need remediation - const: red description: functionality is in a critical state and remediation is urgent has_symptom: $id: schema:has_symptom type: object properties: symptom: type: string description: >- A brief message providing information about the current health status maxLength: 1024 has_indicators: $id: schema:has_indicators type: object properties: indicators: type: object description: >- A key/value map of one or more named component indicators that were used to determine this indicator's health status minProperties: 1 additionalProperties: $ref: 'schema:indicator' has_diagnosis: $id: schema:has_diagnosis type: object properties: diagnosis: type: array items: $ref: 'schema:diagnosis' has_impacts: $id: 'schema:has_impacts' type: object properties: impacts: type: array items: $ref: 'schema:impact' has_details: $id: 'schema:has_details' type: object properties: details: type: object diagnosis: $id: 'schema:diagnosis' type: object properties: cause: type: string description: >- A brief description of a root cause of this health problem action: type: string description: >- A brief description the steps that should be taken to remediate the problem. A more detailed step-by-step guide to remediate the problem is provided by the `help_url` field. affected_resources: # Supported shape and values TBD; this matches the coresponding entry # in Elasticsearch, but its shape is ambiguous as docs claim it to be # a list of strings while it has been observed to be a map of resource # type to list of strings. # see: https://github.com/elastic/elasticsearch/issues/106925 description: >- If the root cause pertains to multiple resources in Logstash, this will hold all resources that this diagnosis is applicable for. help_url: type: string description: >- A link to the troubleshooting guide that’ll fix the health problem. required: - cause - action - help_url unevaluatedProperties: false impact: $id: 'schema:impact' type: object properties: severity: type: integer description: >- How important this impact is to functionality. A value of 1 is the highest severity, with larger values indicating lower severity. description: type: string description: >- A description of the impact on the subject of the indicator. impact_areas: type: array description: >- The areas of functionality that this impact affects. items: # Supported enum values TBD; this matches the corresponding shape in # Elasticsearch, but we have yet to determine a semantic match. type: string oneOf: - const: unknown description: the area of impact is unknown. required: - severity - description unevaluatedProperties: false
Click to expand Example
{ "status": "yellow", "host": "logstash-742.internal", "version": "8.14.1", "snapshot": false, "ephemeral_id": "0f4ac35f-5d5a-4067-9533-7893197cf5f9", "indicators": { "resources": { "status": "yellow", "diagnosis": [ { "cause": "JVM garbage collection is spending significant time", "action": "Tune memory to reflect your workload", "help_url": "https://ela.st/logstash-memory-pressure" } ] }, "pipelines": { "status": "yellow", "indicators": { "pipeline-one": { "status": "green", "symptom": "everything is okay" }, "pipeline-two": { "status": "yellow", "symptom": "multiple probes report degraded processing", "diagnosis": [ { "cause": "persistent queue shows net growth over the last 5 minutes", "action": "look downstream for backpressure", "help_url": "https://ela.st/logstash-pq-growth" },{ "cause": "workers fully utilized", "action": "increase worker capacity", "help_url": "https://ela.st/logstash-worker-allocation" } ], "impacts": [ { "severity": 10, "description": "Growth of the Persisted Queue means increased lag" },{ "severity": 10, "description": "When workers are fully utilized their throughput is limited" } ] } } } } }
Internally:
- an indicator either has at least one probe XOR has at least one inner indicator, and is only as healthy as its least-healthy component.
- probes themselves aren't exposed via the API, rather they are the internal component that can add a diagnosis and impacts to the indicator and degrade its status.
- the top-level
status
of any api response that includes it reflects the same value as one running theGET /_health_report
, includingGET /_node
andGET /_node_stats
. - the health report itself will initially be on-demand when the API request is made, but may later be made to run on a schedule or cached.
In the first stage we will introduce the GET /_health_report
endpoint itself with the following indicators and probes:
-
#/indicators/resources
- probe:
resources:memory_pressure
:- degraded when GC pause > 4% of wall-clock
- critical when GC pause > 8% of wall-clock
- first-pass: lifetime cumuilative metric
- stretch goal: flow metric using
last_1_minute
window
- probe:
-
#/indicators/pipelines/indicators/<pipeline-id>
:- probe:
pipelines:up
:- critical when stopped
- degraded when starting or stopping
- okay when running
- stretch goal: track transition states, including restarts, so that we can keep green through a quick restart.
- probe:
Phase 2: Additional pipeline probes
In subsequent stages we will introduce additional probes to the pipeline indicators to allow them to diagnose the pipeline's behavior from its flow state. Each probe will feed off of the flow metrics for the pipeline, and will present pipeline-specific settings in the pipeline.health.probe.<probe-name>
namespace for configuration.
For example, a probe queue_persisted_growth_events
that inspects the queue_persisted_growth_events
flow metric would have default settings like:
pipeline.health.probe.queue_persisted_growth_events:
enabled: auto # true if queue.type == persisted
degraded: last_5_minutes > 0
critical: last_15_minutes > 0
Or a worker_utilization
probe that inspects the worker_utilization
flow metric to report issues if the workers are fully-utilized:
pipeline.health.probe.worker_utilization:
enabled: true
degraded: last_1_minute > 99
critical: last_15_minutes > 99
Implementation note: the
dentaku
calculator library for ruby would allow us to parse a math expression with variables into an AST, to query that AST for the named variables it needs, and to efficiently evaluate that AST with a set of provided variables. This allows us to (1) support an explicit subset of possibe AST's (notablyArithmetic
,Comparator
,Grouping
,Numeric
, andIdentifier
, possiblyFunction(type=logical)
), (2) reject configurations that reference variables we know will never exist, and (3) avoid evaluating atrigger
orrecover
expression when the flow metric window it needs is not yet available. It is a prime candidate for accelerating development, but care should be taken to avoid using its auto-ast-caching (which has no cache invalidation), and to limit expressions to the minimum-necessary allowlist of node types to ensure portability (likely with a validation visitor).
Split options:
If making these probes configurable adds substantial delay, then we can ship them hard-coded with only the enabled
option, and split the configurability off into a separate effort.
Phase 3: Observing recovery in critical probes
With flow metrics, it is possible to differentiate active-critical situations from ones in active recovery. For example, a PQ having net-growth over the last 15 minutes may be a critical situation, but if we obseerve that we also have net-shrink over the last 5 minutes the situation isn't as dire, so it (a) shouldn't push the indicator into the red and (b) is capable of producing different diagnostic output.
At a future point we can add the concept of recovery
to the flow-based probe prototype. When a probe tests positive for critical
, we could also test its recovery
to present an appropriate result.
pipeline.health.probe.queue_persisted_growth_events:
enabled: auto # true if queue.type == persisted
degraded: last_5_minutes > 0
critical: last_15_minutes > 0
recovery: last_1_minute <= 0
Really interesting and complete proposal. Have only one concern, the impact.severity
field is defined as integer and from it's description
How important this impact is to functionality. A value of 1 is the highest severity, with larger values indicating lower severity.
As user I would expect that severity follow the natural order, so 1 is low and 10 is high. I also think that ti provide better information, we have cap the max value so a user could ask 10 is severe enough or we have also 100? In this case having the severity in a range, say 1..10 could provide a better measure; I left out 0 intentionally, because 0 means "no problem", so no need for existence of the field in that case.
As user I would expect that severity follow the natural order, so 1 is low and 10 is high.
I too was surprised by this, but this is as-defined and implemented in Elasticsearch 's corresponding endpoint, and corresponds to the severity in support systems where a "Sev 1" is of utmost importance. The goal here is to follow their prior art wherever possible.
This is a thoughtful proposal. I have two questions regarding the schema.
-
has_status
. I can seestatus
always has value fromunknown
togreen
. Do you think we can remove the layerhas_status
, sostatus
pop to the upper level? -
diagnosis.affected_resources
vsimpact.impact_areas
. Can you tell me what is the difference? They sound like a similar thing.
It really shows the amount of effort put into this, it's much appreciated. a) I'd be nice to have a sample (manually constructed) json response from this API to help visualize the benefit of this API.
b) A concern I have is that the wider and more in depth the indicator tree grows, the more sensitive the top level status becomes to small perturbations. A suggestion is that we could allow the incubation of indicators. These would be visible in the health report but not yet bubble up their contribution to the parent indicator. A way to implement this would be to mark its impact to the impact_areas
as 0.
c)
probes themselves aren't exposed via the API, rather they are the internal component that can add a diagnosis and impacts to the indicator and degrade its status.
Not even behind a ?show_probes=true
flag? :) Maybe with a few examples of the API's return it becomes clearer that it's not necessary, but without the probe values I wonder if users will question the overall report.
This is awesome, and I want to prioritize it as soon as possible. For Phase 1, do you have a high level sense of the amount of development time required? I really like relying on the prior art and being consistent with the rest of the stack. Looking at the Elasticsearch health api, I would like to verify that our indicators are consistent with that approach- https://www.elastic.co/guide/en/elasticsearch/reference/current/health-api.html Example ES indicator - shards_availability {unassigned_primaries, initializing_primaries, creating_primaries...} Example LS indicator - resources (memory_pressure}
I think memory pressure is more similar to the ES indicator level than resources, and it will give us more options for properties of memory pressure. I understand the desire to have pipeline level information, but a more appropriate indicator at the LS health API level might be pipeline_flow with individual pipelines as properties and pipeline issues as properties.
has_status
. I can seestatus
always has value fromunknown
togreen
. Do you think we can remove the layerhas_status
, sostatus
pop to the upper level?
I was a little split on this, but ended up breaking has_status
(and has_indicators
, etc) off to reduce repetition and to reduce the scope of what a reader of the schema needs to be aware of at each level. Each of the has_*
sub-schemas validates both the key and the shape of the value, which I found to be more straight-forward than validating the key in multiple separate places and using the $ref
only for the value. For this one in particular we could also just use an enum
, but I wanted to be able to include a description for each of the values.
diagnosis.affected_resources
vsimpact.impact_areas
. Can you tell me what is the difference? They sound like a similar thing.
This is directly pulled from the Elasticsearch health report's schema, and the difference is still a little vague to me.
In my mental model, a diagnosis
tells you about something that isn't right, gives you steps to remediate it, and indicates what areas are affected, but an impact
tells you how your processing is affected by the thing that is wrong. I'm not entirely sure why they are decoupled in the Elasticsearch endpoint, but I don't have a strong reason to break from their prior-art.
a) I'd be nice to have a sample (manually constructed) json response from this API to help visualize the benefit of this API.
I've added one, immediately below the schema.
b) A concern I have is that the wider and more in depth the indicator tree grows, the more sensitive the top level status becomes to small perturbations. A suggestion is that we could allow the incubation of indicators. These would be visible in the health report but not yet bubble up their contribution to the parent indicator. A way to implement this would be to mark its impact to the
impact_areas
as 0.
I see the concern, and think we can handle this in two ways:
- enable a probe to include a
diagnosis
without degrading the status of the indicator, so that we can introduce new "advisory" or "incubator" probes - enable the acking of an indicator, a probe-type, or a specific probe via an API end-point, preventing it from contributing to the status either forever or until a TTL has passed (but not preventing it from supplying a diagnosis).
c)
probes themselves aren't exposed via the API, rather they are the internal component that can add a diagnosis and impacts to the indicator and degrade its status.
Not even behind a
?show_probes=true
flag? :) Maybe with a few examples of the API's return it becomes clearer that it's not necessary, but without the probe values I wonder if users will question the overall report.
I hope that this is not needed, but I believe the implementation can defer it until it is needed. One way would be to have the details
(which is a free-form key/value map) include the probes by their type and their affect on the status in order to add confidence. The Elasticsearch indicators each have a variety of details
that they present specific to what is being observed, and we can define the specifics of our details
for each top-level indicator or for the pipeline-indicator type.
I think memory pressure is more similar to the ES indicator level than resources, and it will give us more options for properties of memory pressure. I understand the desire to have pipeline level information, but a more appropriate indicator at the LS health API level might be pipeline_flow with individual pipelines as properties and pipeline issues as properties.
We can have many probes that contribute to degrading the status of the resources
indicator, including multiple ones that expose different types of memory pressure (such as tracking new metrics like humongous allocations). Perhaps the phase 1 too-much-time-collecting-garbage probe would be better named. I avoided breaking the top-level resources
into component parts memory
, disk
, cpu
, etc., because grouping them gives us the freedom to develop probes that view these resources collectively in the future without breaking the shape of the API response.
The Elasticsearch indicators map to either whole cluster-wide systems (master_is_stable
, ilm
, slm
, repository_integrity
, or shards_capacity
) or cluster-wide resource availability (disk
), and we don't have direct analogues to these things in Logstash. The details
properties show forefront in the Elasticsearch API documentation, but where the API really shines is the diagnosis
, which are fed by a variety of probes.
In shaping the top-level indicators to be resources
and pipelines
, my goal is to let a user of the API know (1) this process has the resources it needs and (2) its pipelines are processing events as-expected, and when one or more pipelines isn't processing as-expected, they can drill down into the specific pipeline before being flooded with diagnosis and guidance.
This is awesome work, and I can see a lot of utility here, and some potential uses for our dashboards, and maybe even autoscaling in ECK.
- Would you anticipate being able to get subsets of the health report, eg
GET /_health_report/pipelines/<pipeline_id>
for individual pipelines? - Would you anticipate having a "verbose/minimized" feature to reduce payloads/weight for clients that poll just looking for status?
- Would it be possible to turn off status propagation altogether at the pipeline level for non-critical pipelines? I am thinking about configurations that we see in production, where there are a huge number of pipelines, some of which are either low traffic or QA, etc?
- Would we consider a 'source' or similar field to determine the source probe for the degradation number? Or are the probes an "implementation detail"?
- Probably outside of the scope of this issue, but I'd love to get the
memory_pressure
numbers available, either here or in the general metrics API.
- Would you anticipate being able to get subsets of the health report, eg
GET /_health_report/pipelines/<pipeline_id>
for individual pipelines?
Absolutely. I believe that since pipelines
is an indicator containing sub-indicators named after their respective pipeline id's, what you have suggested perfectly extends the pattern already present in Elasticsearch's comparable API.
- Would you anticipate having a "verbose/minimized" feature to reduce payloads/weight for clients that poll just looking for status?
Yes. Elasticsearch's comparable API uses ?verbose=false
for this effect, and I think it makes sense here too especially if we either cache the calculations or otherwise calculate the health status out-of-band from the HTTP API request (e.g., periodically running each probe).
Worth noting, Elaticsearch's version of this is also documented as a way of avoiding calculating the details
or diagnosis
to reduce computation cost, so we will need to set clear expectations here.
- Would it be possible to turn off status propagation altogether at the pipeline level for non-critical pipelines? I am thinking about configurations that we see in production, where there are a huge number of pipelines, some of which are either low traffic or QA, etc?
This dove-tails into some of @jsvd's concerns, and I think we have a couple of options on the table that we can decide on if-and-when we get closer to seeing how it plays out.
- acking (with optional TTL) an indicator or sub-indicator (possibly scoped to a specific diagnosis) to prevent a known issue from propagating.
- introduce a pipeline-level setting to either directly control propagation (
pipeline.health.propagate
:(true
/false
)), or to indirectly control it by flagging the role of the pipeline (I don't have verbiage yet, but something that succinctly differentiates whether they want or need the pipeline to be healthy in order for thepipelines
indicator to be considered healthy).
- Would we consider a 'source' or similar field to determine the source probe for the degradation number? Or are the probes an "implementation detail"?
I think that the probes themselves are an implementation detail, but that they should be able to produce a diagnosis that leads us back to the specifics.
The Elasticsearch API's diagnosis
is observed to have an id
that is descriptive enough to serve this purpose (such as elasticsearch:health:shards_availability:diagnosis:increase_tier_capacity_for_allocations:tier:data_content
).
- Probably outside of the scope of this issue, but I'd love to get the
memory_pressure
numbers available, either here or in the general metrics API.
Absolutely.
The existing metrics API can be extremely verbose, and I am wary of continuing to add general noise to it without a plan to also deprecate some of the noise that is already there. I imagine these numbers becoming available as part of the resources
indicator's details
key/value map, and/or within the diagnosis
that is presented by a probe. I could also see an option to include healthy diagnosis info so that we can add confidence to why we consider a healthy indicator to be healthy.