elasticsearch Collect APM metrics for failure stores

Collecting metrics for the number of failures, during both ingest and the bulk operation (regardless of whether failure store is enabled/applicable), and redirects to the failure store will give us and users some insights into failure counts/rates of specific indices/data streams.

May 04 '24 07:05 nielsbauman

Pinging @elastic/es-data-management (Team:Data Management)

May 04 '24 07:05 elasticsearchmachine

@gmarouli I could see value in users getting insights into which data streams see a lot of failures and could thus benefit from a failure store. I also don't immediately see value in tracking failure counts for non-data stream indices, so it probably makes sense to just track the failures for data streams only. Maybe @jbaiera can elaborate on this?

May 14 '24 09:05 nielsbauman

I think it would be important to capture the following information:

Counter of rejected docs per index
Counter of documents that have landed in the failure store
- Information about the category of the failure (like the exception type), and whether it has happened in the pipeline vs a mapping issue. Maybe that's the same.
Counter of documents that were indexed but are degraded, in the sense that it has an _ignored field
Counter of total documents that were requested to be indexed per index, to allow calculating the percentage of failed documents

One doubt that I have is whether the index name may be a high cardinality attribute so that we have to track lots of different metrics in memory if there are a lot of indices. But since we have per-index metadata in memory anyway, it may not be a big issue. Maybe we need to take care that we only track these metrics on the node that holds a primary shard for an index.

Also cc @ruflin @flash1293 to provide input on which ingestion/failure store metrics are relevant.

May 21 '24 14:05 felixbarny

@felixbarny We discussed this topic last week with @jbaiera in the ES+Obs sync and concluded that for the time being we can count the docs by looking at index stats and grouping them in a certain way as part of a background job running in Kibana. This is how we are collecting data telemetry already, so it would require the least amount of changes as it's mostly an extension of the existing setup - it also has the additional advantage that it would work for all environments including on-prem.

A similar thing is discussed here for the _ignored field: https://github.com/elastic/elasticsearch/issues/108092

These two things together would cover most of the telemetry needs from Observability we know about right now. This document is laying out our plans: https://docs.google.com/document/d/1Qy8DlsJTVfi8zDvZMAkOxjvh6zsXFJRJYR_aRc7eGLQ/edit#heading=h.1qaayw9lp75l

This isn't meant to block these efforts, just as an FYI for the immediate plans on that front.

May 21 '24 14:05 flash1293

One tradeoff to be aware of is that we then can't track the number and ratio of rejected docs as they'll never make it into the system.

May 21 '24 14:05 felixbarny

Counter of rejected docs per index

@felixbarny Just want to clarify that you mean documents that have failed and their failures will not be collected in the failure store?

There are some cases where a failure would be directed to the failure store, but due to availability issues it may be returned to the user with an error instead. Should we track those documents as well?

Counter of documents that were indexed but are degraded, in the sense that it has an _ignored field

I don't think we'll be able to collect these stats from where we're focusing in this PR - this will have to be done closer to the mapping code since the documents are accepted without any notification to the client that they have degraded.

One doubt that I have is whether the index name may be a high cardinality attribute so that we have to track lots of different metrics in memory if there are a lot of indices.

I checked with core infra a while back about the metrics collection and how it handles high cardinality. Their read on the situation is that the cardinality shouldn't be an issue as long as it's not part of the counter name and only part of the attributes on the counter when incrementing it.

Maybe we need to take care that we only track these metrics on the node that holds a primary shard for an index.

Most of these stats will need to be collected when handling the bulk request on the coordinating node, and since any node can coordinate a bulk operation it's unlikely we'll be able to constrain the collection to the primary shards 😞

May 22 '24 14:05 jbaiera

Counter of total documents that were requested to be indexed per index, to allow calculating the percentage of failed documents

Do we want to track this for all data streams regardless of if they have failure stores enabled? Collecting stats about data streams without the feature enabled should make the metric name more generic, but if we only care about totals to data streams with failure store enabled it should be a more specific metric name.

@felixbarny Just make sure we're aligned on the possible counters for this PR, here are the things that I think we could reasonable track right now along with some suggested names for the metrics. Please chime in on the metric name suggestions because they likely need to be refined still.

Counters

Total documents sent to a data stream
- All data streams or just those with failure store enabled?
- Suggestion es.data_stream.ingest.documents.total if we want all data streams regardless of failure store enabled
Rejected documents
- Suggestion es.data_stream.ingest.documents.rejected
Failures that could have been stored but wont because failure store is disabled (?)
- Suggestion es.data_stream.ingest.documents.failed - maybe with the failure_store_enabled attribute set false?
Documents sent to the failure store
- Suggestion es.data_stream.ingest.documents.failed
Documents rejected from the failure store (?)
- Suggestion es.data_stream.ingest.documents.failure_store.rejected

Attributes to trace

Data stream name (data_stream_name)
Exception Type (error_type)
Pipeline vs Mapping issue (error_location (?))

Out of scope:

Count of degraded documents (_ignored field)
- This isn't really failure store related - this data is technically successfully indexed. We'll need to track this somewhere else.

May 22 '24 16:05 jbaiera

Counter of rejected docs per index

@felixbarny Just want to clarify that you mean documents that have failed and their failures will not be collected in the failure store?

Yes, that's what I meant. The intention behind that is that we can find out how well we're tracking against the "accept all logs" principle. When the failure store is enabled, we'd expect the ratio of rejected documents to go down significantly. But the failure store isn't enabled by default - at least in the beginning. It can also help to identify leaks where the failure store was supposed to catch a failure but didn't for some reason.

Counter of documents that were indexed but are degraded, in the sense that it has an _ignored field

I don't think we'll be able to collect these stats from where we're focusing in this PR

Agree that this would be outside the scope of this PR. At the same time, I wanted to look at the problem more holistically in terms of tracking ingestion issues vs just tracking failure store metrics. Tracking the failure store is part of it but it's part of a bigger story where the pieces should eventually fit together (like consistent naming, being able to create a single dashboard for the different metrics, etc.)

cardinality shouldn't be an issue as long as it's not part of the counter name and only part of the attributes on the counter when incrementing it.

While I agree that the index name should be an attribute and not part of the metric name, from a memory and storage perspective, it doesn't really matter if the cardinality is encoded in an attribute vs the name.

Do we want to track this for all data streams regardless of if they have failure stores enabled?

I think that would make sense but we should discuss this in some more detail. We should also align with the telemetry efforts @flash1293 is working on and think about who owns these metrics and which thresholds we want to set for ourselves.

Failures that could have been stored but wont because failure store is disabled (?)

I'd expect those to be part of the .rejected metric. Maybe that metric could have an attribute to indicate whether the failure store would have caught it.

Documents sent to the failure store

Suggestion es.data_stream.ingest.documents.failed

I think failed sounds a bit ambiguous and it's not intuitively clear what the difference to rejected is. Maybe es.data_stream.ingest.documents.failure_store would be a little more clear. I don't think it's important that all metrics end with a past tense verb. If you think otherwise, maybe *.failure_stored?

Documents rejected from the failure store (?)

Suggestion es.data_stream.ingest.documents.failure_store.rejected

I'd also use the es.data_stream.ingest.documents.rejected metric for that. Potentially with a separate attribute that disambiguates regular rejections from those that the failure store rejected.

In general, it should be possible to get the failure rate by doing es.data_stream.ingest.documents.rejected / es.data_stream.ingest.documents.total. Therefore, all rejected documents should be included in es.data_stream.ingest.documents.rejected, no matter whether they could have been stored in the failure store or whether the failure store rejected them.

Count of degraded documents (_ignored field)

Agree that this is out of scope for now. In the future we could just add another metric like es.data_stream.ingest.documents.degraded.

May 29 '24 08:05 felixbarny

cc @elastic/es-storage-engine: this discussion is probably also relevant for you as it helps track our progress towards the goal of accepting all logs. It can help us identify areas where logs may still get dropped or added to the failure store unnecessarily so that we can prioritize efforts to minimize data loss or falling back on the failure store when there are other alternatives.

May 29 '24 11:05 felixbarny

After several iterations of internal discussion, I think we reached a solution (proposal) that works for everyone. I just pushed a commit that should align the implementation with the discussed proposal. I don't expect the observability folks to look at the new changes themselves - as those are just implementation details of the proposal (i.e. feel free to unsubscribe from this PR :) ). I'll update the description of this PR as well to reflect the discussed proposal.

Jul 17 '24 14:07 nielsbauman

Hit "request changes" before actually typing my review text 🙄

I think the changes are looking good, I mostly have some concerns around the refactoring in the ingest service. I wonder if we could increase the test coverage to include situations like updating the index field and using the reroute processor

Aug 01 '24 20:08 jbaiera

elasticsearch elasticsearch copied to clipboard

Collect APM metrics for failure stores

elasticsearch
elasticsearch copied to clipboard