logstash icon indicating copy to clipboard operation
logstash copied to clipboard

Use metadata for data_stream_auto_routing

Open smalenfant opened this issue 4 years ago • 5 comments

While configuring Logstash for data steam, I noticed that the following would get indexed:

    "data_stream": {
      "dataset": "ds",
      "type": "metrics",
      "namespace": "cdn"
    },

These fields are now duplicated and takes a lot of space. This should be using @metadata instead.

I tried to workaround this but data streams don't accept variables.

logstash  |   output {
logstash  |     elasticsearch {
logstash  |       # This setting must be a ["logs", "metrics", "synthetics"]
logstash  |       # Expected one of ["logs", "metrics", "synthetics"], got ["%{[@data_stream][type]}"]
logstash  |       data_stream_type => "%{[@data_stream][type]}"
logstash  |       ...
logstash  |     }
logstash  |   }

I tried to workaround using the following:

filter {
  mutate {
    add_field => {
      "[@data_stream][type]" => "metrics"
      "[@data_stream][dataset]" => "ds"
      "[@data_stream][namespace]" => "%{[tags][cdn]}"
    }
  }
}

output {

  stdout { }
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    http_compression => true
    sniffing => false
    data_stream => true
    data_stream_auto_routing => false
    data_stream_dataset => "%{[@data_stream][dataset]}"
    data_stream_namespace => "%{[@data_stream][namespace]}"
  }
}

But that didn't work. The variable expansion didn't happen. Index was logs-%{[@data_stream][dataset]}-%{[@data_stream][namespace]}

Maybe I'm doing something wrong. Please let me know if there is a workaround.

smalenfant avatar Dec 16 '21 14:12 smalenfant

These fields are now duplicated and takes a lot of space. This should be using @metadata instead.

data_stream.type etc fields are of a constant keyword type (unless the mapping has been changed) - these take up no extra space - the value is stored in the mapping.

data_stream_type option and related in ES output are meant to be static (for now), in case writing to a known DS (type), the dynamic (multi-DS) case is supposed to be handled by relying on whether the event contains. effectively it is a substitute for when the event does not contain DS routing information.

not sure if there's other compelling reasons to support sprintf on these except the false idea to save up space by dropping the datastream.type, datastream.namespace, datastream.dataset keyword fields ...

kares avatar Jan 26 '22 07:01 kares

That was key info here... constant keyword. But when using your own mapping, these needs to be configured... I would still love to not see them in the index output if possible.

Is there any ways to remove certain field by default from Kibana queries?

smalenfant avatar Jan 26 '22 15:01 smalenfant

+1 this can be super handy to dynamically create data streams. I would like to dispatch my events dynamically into a different data stream by specifying a variable value for data_stream_dataset field. Example:

    elasticsearch {
        hosts => "https://10.1.1.45:9200"
        user => "YYY"
        password => "XXX"
        data_stream => "true"
        data_stream_type => "metrics"
        data_stream_dataset => "%{dataset}"
        data_stream_namespace => "snmpwalk"
    }

michaelhyatt avatar Oct 13 '22 09:10 michaelhyatt

+1 could be a very nice upgrade https://discuss.elastic.co/t/dynamic-naming-of-elasticsearch-data-streams/325278/3

sbocquet avatar Feb 10 '23 16:02 sbocquet

+1 as it's a very interesting way to avoid a lot of 'if' conditions on a logstash/ingest pipeline

btw, the use a variable to define a dataset is already possible using the Reroute processor on Ingest pipelines.

MatheusGelinskiPires avatar May 16 '24 22:05 MatheusGelinskiPires