beats icon indicating copy to clipboard operation
beats copied to clipboard

Deprecation file logs should not emit data_stream.dataset

Open pgomulka opened this issue 3 years ago • 13 comments

Elasticsearch Version

8.4.3

Installed Plugins

No response

Java Version

bundled

OS Version

macos

Problem Description

As per ECS, datastream.dataset is a constant-keyword: https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#field-data-stream-dataset, which means, as per constant_keyword definition: the first document indexed, and having this field defined, will set this field for the remaining documents or makes documents with a different value to be rejected.

As a consequence, when user is doing stack monitoring with filebeat, indexing to the default filebeat-x.y.z datastream, any indexed document (coming from stack monitoring or any other filebeat modules) ends up with "data_stream.dataset":"deprecation.elasticsearch":

we should not emit data_stream.dataset in deprecation.json log file. data_stream.datastream should still be emitted into .logs-deprecation.elasticsearch-default when cluster.deprecation_indexing.enabled: true (default)

the documentation of the data_stream.dataset field also indicates that it should have the same value as event.dataset. This makes me wonder if perhaps when logs are being ingested into ES cluster they should go into individual datastreams per data_stream.dataset value?

Steps to Reproduce

emit deprecation logs

Logs (if relevant)

No response

pgomulka avatar Nov 22 '22 09:11 pgomulka

Pinging @elastic/es-core-infra (Team:Core/Infra)

elasticsearchmachine avatar Nov 22 '22 09:11 elasticsearchmachine

note - this is possibly a breaking change, similar to https://github.com/elastic/elasticsearch/issues/83251

pgomulka avatar Nov 22 '22 09:11 pgomulka

Or should the presence of data_stream.dataset field in logs indicate that that it should be sent into dedicated datastream?

pgomulka avatar Nov 22 '22 15:11 pgomulka

data_stream.dataset should only be set as constant_keyword when the data stream naming scheme is used. In the scenario of filebeat-* data streams, data_stream.* fields must be set as keyword to not have a conflict. We should have a look at the template that filebeat sets up and if data_stream.* in there is set as constant_keyword, it shouldn't.

The above should solves it for filebeat. But there is still a chance the problem above shows up for the data stream naming scheme if the data comes out of a firehose output. Document based routing will solve this problem. There is even a ES PR https://github.com/elastic/elasticsearch/pull/76511 that would solve it at least for the data_stream.* fields.

ruflin avatar Nov 23 '22 15:11 ruflin

So this sounds like a specific case of "event contains a field where the "type" of the field is in conflict with ECS". Much like if the event contained geo.location but that field wasn't a geo_point. Normally I think we say "don't do that", but I see the point @ruflin is making. I'll do some digging into the templates, but I do think think we should answer why setting any of the data_stream.* fields is a valid thing to do if you don't know if you are writing to a data_stream.

leehinman avatar Nov 23 '22 19:11 leehinman

The fields in Beats are set here: https://github.com/elastic/beats/blob/main/libbeat/_meta/fields.ecs.yml#L894 As filebeat always ships to a single data stream, this must be changed to keyword instead of constant_keyword.

we should answer why setting any of the data_stream.* fields is a valid thing to do if you don't know if you are writing to a data_stream.

We recommend to use ECS fields in our logs. Elasticsearch does this which is great. The service that writes the logs to disk, cannot know who picks up the logs. If it would be Elastic Agent, this would work as expected. Because it is filebeat, it doesn't. My take is what Elasticsearch does in this scenario is correct but the culprit is Filebeat because Filebeat sets a field which is not compatible with the way Filbeat ingests data as it does not follow the data stream naming scheme.

ruflin avatar Nov 24 '22 07:11 ruflin

If it would be Elastic Agent, this would work as expected.

I'm not entirely convinced this is a true statement. What if the custom log integration was used? The data_stream.dataset would be constant keyword, but the value would be set by the integration not the value in the Elasticsearch log.

@ebeahan do you have any input on which entities should set data_stream.dataset and expectations around constant_keyword

But I do agree that Filebeat shouldn't be setting the field type to constant_keyword if it doesn't make sense for the index. I think we would need to look at the index name during setup and modify the template before we send it up. Is the heuristic "if the index name isn't logs-*-* or metrics-*-*, then data_stream.dataset is keyword" sufficient? We could change the fields.ecs.yml file but then that would cause problems if the user configured filebeat to try and use the data stream naming scheme by hand. I'd really like to limit unintended consequences.

leehinman avatar Dec 12 '22 21:12 leehinman

I'm not entirely convinced this is a true statement. What if the custom log integration was used? The data_stream.dataset would be constant keyword, but the value would be set by the integration not the value in the Elasticsearch log.

This is correct if the custom log integration would be used but I don't think this scenario applies here? For cases where the data_stream.* fields are set in the log file and custom log is used, we need https://github.com/elastic/elasticsearch/issues/63798 APM logs can hit this scenario and is why https://github.com/elastic/elasticsearch/issues/63798 was opened.

But I do agree that Filebeat shouldn't be setting the field type to constant_keyword if it doesn't make sense for the index. I think we would need to look at the index name during setup and modify the template before we send it up. Is the heuristic "if the index name isn't logs-- or metrics--, then data_stream.dataset is keyword" sufficient? We could change the fields.ecs.yml file but then that would cause problems if the user configured filebeat to try and use the data stream naming scheme by hand. I'd really like to limit unintended consequences.

Maybe we can turn it around. Make data_stream.* a keyword by default but allow it to be constant_keyword in the context of the data stream naming scheme?

ruflin avatar Dec 13 '22 09:12 ruflin

Just noticed our kibana.audit dataset contains

data_stream.dataset : deprecation.elasticsearch

Possibly related to this issue. Very confusing situation imho.. https://github.com/elastic/elasticsearch/issues/83251 seem related..

willemdh avatar Aug 01 '23 12:08 willemdh

@pgomulka is going to look into this to gather context so we can discuss

mosche avatar Jun 06 '24 16:06 mosche

image looks like the issue is still valid. It is not necessarily an ES issue, but more like a 'logging shipment infrastructure' problem I think. the data_stream_dataset is still constant https://github.com/elastic/beats/blob/main/libbeat/_meta/fields.ecs.yml#L916

@cmacknz or @leehinman do you think this can be fixed on elastic agent side?

pgomulka avatar Jun 07 '24 09:06 pgomulka

Reading the discussion here fixing this mapping in libbeat seems like the correct path. CC @pierrehilbert this would fall into your team's area.

cmacknz avatar Jun 10 '24 14:06 cmacknz

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

elasticmachine avatar Jul 10 '24 17:07 elasticmachine

This bug caused considerable confusion in my organization this month as we wasted considerable resources attempting to determine what was wrong with out Filebeat event processing flows, only to discover that it was an index template issue all along. You have identified what needs to be done (change the type of the key data_stream.dataset from constant_keyword to keyword). We would appreciate seeing that change applied in the Beats codebase so that we don't have to keep modifying the index templates with each new Beats update in our environment.

jgregmac avatar May 09 '25 12:05 jgregmac