Deprecation file logs should not emit data_stream.dataset
Elasticsearch Version
8.4.3
Installed Plugins
No response
Java Version
bundled
OS Version
macos
Problem Description
As per ECS, datastream.dataset is a constant-keyword: https://www.elastic.co/guide/en/ecs/master/ecs-data_stream.html#field-data-stream-dataset, which means, as per constant_keyword definition: the first document indexed, and having this field defined, will set this field for the remaining documents or makes documents with a different value to be rejected.
As a consequence, when user is doing stack monitoring with filebeat, indexing to the default filebeat-x.y.z datastream, any indexed document (coming from stack monitoring or any other filebeat modules) ends up with "data_stream.dataset":"deprecation.elasticsearch":
we should not emit data_stream.dataset in deprecation.json log file.
data_stream.datastream should still be emitted into .logs-deprecation.elasticsearch-default when cluster.deprecation_indexing.enabled: true (default)
the documentation of the data_stream.dataset field also indicates that it should have the same value as event.dataset. This makes me wonder if perhaps when logs are being ingested into ES cluster they should go into individual datastreams per data_stream.dataset value?
Steps to Reproduce
emit deprecation logs
Logs (if relevant)
No response
Pinging @elastic/es-core-infra (Team:Core/Infra)
note - this is possibly a breaking change, similar to https://github.com/elastic/elasticsearch/issues/83251
Or should the presence of data_stream.dataset field in logs indicate that that it should be sent into dedicated datastream?
data_stream.dataset should only be set as constant_keyword when the data stream naming scheme is used. In the scenario of filebeat-* data streams, data_stream.* fields must be set as keyword to not have a conflict. We should have a look at the template that filebeat sets up and if data_stream.* in there is set as constant_keyword, it shouldn't.
The above should solves it for filebeat. But there is still a chance the problem above shows up for the data stream naming scheme if the data comes out of a firehose output. Document based routing will solve this problem. There is even a ES PR https://github.com/elastic/elasticsearch/pull/76511 that would solve it at least for the data_stream.* fields.
So this sounds like a specific case of "event contains a field where the "type" of the field is in conflict with ECS". Much like if the event contained geo.location but that field wasn't a geo_point. Normally I think we say "don't do that", but I see the point @ruflin is making. I'll do some digging into the templates, but I do think think we should answer why setting any of the data_stream.* fields is a valid thing to do if you don't know if you are writing to a data_stream.
The fields in Beats are set here: https://github.com/elastic/beats/blob/main/libbeat/_meta/fields.ecs.yml#L894 As filebeat always ships to a single data stream, this must be changed to keyword instead of constant_keyword.
we should answer why setting any of the data_stream.* fields is a valid thing to do if you don't know if you are writing to a data_stream.
We recommend to use ECS fields in our logs. Elasticsearch does this which is great. The service that writes the logs to disk, cannot know who picks up the logs. If it would be Elastic Agent, this would work as expected. Because it is filebeat, it doesn't. My take is what Elasticsearch does in this scenario is correct but the culprit is Filebeat because Filebeat sets a field which is not compatible with the way Filbeat ingests data as it does not follow the data stream naming scheme.
If it would be Elastic Agent, this would work as expected.
I'm not entirely convinced this is a true statement. What if the custom log integration was used? The data_stream.dataset would be constant keyword, but the value would be set by the integration not the value in the Elasticsearch log.
@ebeahan do you have any input on which entities should set data_stream.dataset and expectations around constant_keyword
But I do agree that Filebeat shouldn't be setting the field type to constant_keyword if it doesn't make sense for the index. I think we would need to look at the index name during setup and modify the template before we send it up. Is the heuristic "if the index name isn't logs-*-* or metrics-*-*, then data_stream.dataset is keyword" sufficient? We could change the fields.ecs.yml file but then that would cause problems if the user configured filebeat to try and use the data stream naming scheme by hand. I'd really like to limit unintended consequences.
I'm not entirely convinced this is a true statement. What if the custom log integration was used? The data_stream.dataset would be constant keyword, but the value would be set by the integration not the value in the Elasticsearch log.
This is correct if the custom log integration would be used but I don't think this scenario applies here? For cases where the data_stream.* fields are set in the log file and custom log is used, we need https://github.com/elastic/elasticsearch/issues/63798 APM logs can hit this scenario and is why https://github.com/elastic/elasticsearch/issues/63798 was opened.
But I do agree that Filebeat shouldn't be setting the field type to constant_keyword if it doesn't make sense for the index. I think we would need to look at the index name during setup and modify the template before we send it up. Is the heuristic "if the index name isn't logs-- or metrics--, then data_stream.dataset is keyword" sufficient? We could change the fields.ecs.yml file but then that would cause problems if the user configured filebeat to try and use the data stream naming scheme by hand. I'd really like to limit unintended consequences.
Maybe we can turn it around. Make data_stream.* a keyword by default but allow it to be constant_keyword in the context of the data stream naming scheme?
Just noticed our kibana.audit dataset contains
data_stream.dataset : deprecation.elasticsearch
Possibly related to this issue. Very confusing situation imho.. https://github.com/elastic/elasticsearch/issues/83251 seem related..
@pgomulka is going to look into this to gather context so we can discuss
@cmacknz or @leehinman do you think this can be fixed on elastic agent side?
Reading the discussion here fixing this mapping in libbeat seems like the correct path. CC @pierrehilbert this would fall into your team's area.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)
This bug caused considerable confusion in my organization this month as we wasted considerable resources attempting to determine what was wrong with out Filebeat event processing flows, only to discover that it was an index template issue all along. You have identified what needs to be done (change the type of the key data_stream.dataset from constant_keyword to keyword). We would appreciate seeing that change applied in the Beats codebase so that we don't have to keep modifying the index templates with each new Beats update in our environment.