envoy EnvoyMetricsService receiving empty histograms and summaries

Description:

Using Envoy 1.22.2 and hooking it up to a grpc metric sink for prometheus to scrape later. I'm noticing that while the counters and gauges are coming in correctly, the summaries and histograms are empty. The histograms have buckets, but no samples. If I check on the envoy prometheus metrics, the histograms have values and there are no summaries (not sure why I'm getting those on the grpc sink). Neither the summaries nor the histograms have any actual samples. In the configuration below I tried adding a stats_config to set the default buckets just in case, but it doesn't seem to be working. Any ideas?

% envoy --version

envoy  version: c919bdec19d79e97f4f56e4095706f8e6a383f1c/1.22.2/Modified/RELEASE/BoringSSL

Admin and Stats Output:

/stats/prometheus output for the envoy_cluster_upstream_cx_connect_ms histogram

envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="0.5"} 0
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="1"} 0
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="5"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="10"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="25"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="50"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="100"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="250"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="500"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="1000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="2500"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="5000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="10000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="30000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="60000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="300000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="600000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="1800000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="3600000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="+Inf"} 2
envoy_cluster_upstream_cx_connect_ms_sum{envoy_cluster_name="envoy_exporter"} 4.0999999999999996447286321199499

Config:

overload_manager:
  refresh_interval: 0.25s
  resource_monitors:
    - name: "envoy.resource_monitors.fixed_heap"
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
        max_heap_size_bytes: 2147483648 # 2 GiB
  actions:
    - name: "envoy.overload_actions.shrink_heap"
      triggers:
        - name: "envoy.resource_monitors.fixed_heap"
          threshold:
            value: 0.95
    - name: "envoy.overload_actions.stop_accepting_requests"
      triggers:
        - name: "envoy.resource_monitors.fixed_heap"
          threshold:
            value: 0.98

static_resources:
  clusters:
    - name: control_plane
      per_connection_buffer_limit_bytes: 32768 # 32 KiB
      connect_timeout:
        seconds: 5
      dns_lookup_family: V4_ONLY
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          common_http_protocol_options: { }
          upstream_http_protocol_options: { }
          explicit_http_config:
            http2_protocol_options:
              max_concurrent_streams: 100
              initial_stream_window_size: 65536 # 64 KiB
              initial_connection_window_size: 1048576 # 1 MiB
      upstream_connection_options:
        tcp_keepalive:
          keepalive_probes: 1
          keepalive_time: 10
          keepalive_interval: 10
      load_assignment:
        cluster_name: control_plane
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 7001

    - name: envoy_exporter
      per_connection_buffer_limit_bytes: 32768 # 32 KiB
      connect_timeout:
        seconds: 5
      dns_lookup_family: V4_ONLY
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          common_http_protocol_options: { }
          upstream_http_protocol_options: { }
          explicit_http_config:
            http2_protocol_options:
              max_concurrent_streams: 100
              initial_stream_window_size: 65536 # 64 KiB
              initial_connection_window_size: 1048576 # 1 MiB
      upstream_connection_options:
        tcp_keepalive:
          keepalive_probes: 1
          keepalive_time: 10
          keepalive_interval: 10
      load_assignment:
        cluster_name: envoy_exporter
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 11001

dynamic_resources:
  ads_config:
    api_type: GRPC
    transport_api_version: V3
    grpc_services:
      - envoy_grpc:
          cluster_name: control_plane
  cds_config:
    resource_api_version: V3
    ads: {}
  lds_config:
    resource_api_version: V3
    ads: {}

cluster_manager:
  outlier_detection:
    event_log_path: "/dev/stdout"

node:
  id: service-proxy
  cluster: control_plane

admin:
  access_log_path: /tmp/admin_access.log
  address:
    socket_address: { address: 0.0.0.0, port_value: 15000 }

layered_runtime:
  layers:
    - name: static_layer_0
      static_layer:
        envoy:
          resource_limits:
            listener:
              "listener_443:0.0.0.0:443":
                connection_limit: 10000
        overload:
          global_downstream_max_connections: 50000

#stats_config:
#  histogram_bucket_settings:
#    - match:
#        safe_regex:
#          google_re2: { }
#          regex: ".*?"
#      buckets:
#        - 0.5
#        - 1
#        - 5
#        - 10
#        - 25
#        - 50
#        - 100
#        - 250
#        - 500
#        - 1000
#        - 2500
#        - 5000
#        - 10000
#        - 30000
#        - 60000
#        - 300000
#        - 600000
#        - 1800000
#        - 3600000

stats_sinks:
  - name: envoy.stat_sinks.metrics_service
    typed_config:
      "@type": type.googleapis.com/envoy.config.metrics.v3.MetricsServiceConfig
      transport_api_version: V3
      emit_tags_as_labels: true
      grpc_service:
        envoy_grpc:
          cluster_name: envoy_exporter

Jun 21 '22 23:06 marcosrmendezthd

some screenshots of the empty histogram that is sent to the metrics service.

7:17PM WRN ignoring empty summary metric=envoy_server_initialization_time_ms original-name=server.initialization_time_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_envoy_exporter_upstream_cx_length_ms original-name=cluster.envoy_exporter.upstream_cx_length_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_envoy_exporter_upstream_cx_connect_ms original-name=cluster.envoy_exporter.upstream_cx_connect_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_control_plane_upstream_cx_length_ms original-name=cluster.control_plane.upstream_cx_length_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_manager_cds_update_duration original-name=cluster_manager.cds.update_duration
7:17PM WRN ignoring empty summary metric=envoy_listener_manager_lds_update_duration original-name=listener_manager.lds.update_duration
7:17PM WRN ignoring empty summary metric=envoy_cluster_control_plane_upstream_cx_connect_ms original-name=cluster.control_plane.upstream_cx_connect_ms
7:17PM WRN ignoring empty summary metric=envoy_server_initialization_time_ms original-name=server.initialization_time_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_envoy_exporter_upstream_cx_length_ms original-name=cluster.envoy_exporter.upstream_cx_length_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_envoy_exporter_upstream_cx_connect_ms original-name=cluster.envoy_exporter.upstream_cx_connect_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_control_plane_upstream_cx_length_ms original-name=cluster.control_plane.upstream_cx_length_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_manager_cds_update_duration original-name=cluster_manager.cds.update_duration

Jun 21 '22 23:06 marcosrmendezthd

a better example:

this is what envoy exposes on /stats/prometheus

# TYPE envoy_cluster_upstream_cx_connect_ms histogram
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="0.5"} 0
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="1"} 0
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="5"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="10"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="25"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="50"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="100"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="250"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="500"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="1000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="2500"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="5000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="10000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="30000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="60000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="300000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="600000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="1800000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="3600000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="+Inf"} 2
envoy_cluster_upstream_cx_connect_ms_sum{envoy_cluster_name="exporter"} 6.0999999999999996447286321199499
envoy_cluster_upstream_cx_connect_ms_count{envoy_cluster_name="exporter"} 2

this is what the sample program sees:


2022/06/22 11:22:48 empty histogram
2022/06/22 11:22:48 summary cluster.upstream_cx_connect_ms{"envoy.cluster_name"="exporter"} quantile:<quantile:0 value:nan > quantile:<quantile:0.25 value:nan > quantile:<quantile:0.5 value:nan > quantile:<quantile:0.75 value:nan > quantile:<quantile:0.9 value:nan > quantile:<quantile:0.95 value:nan > quantile:<quantile:0.99 value:nan > quantile:<quantile:0.995 value:nan > quantile:<quantile:0.999 value:nan > quantile:<quantile:1 value:nan > 
2022/06/22 11:22:48 empty histogram

Jun 22 '22 15:06 marcosrmendezthd

to reproduce this run envoy 1.22.2 with the following configuration.

$ envoy --version envoy version: c919bdec19d79e97f4f56e4095706f8e6a383f1c/1.22.2/Modified/RELEASE/BoringSSL

$ envoy -c envoy.yaml


static_resources:
  listeners:
    - address:
        socket_address:
          address: 0.0.0.0
          port_value: 8080
      filter_chains:
        - filters:
            - name: envoy.filters.network.http_connection_manager
              typed_config:
                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
                codec_type: AUTO
                stat_prefix: ingress_http
                route_config:
                  name: local_route
                  virtual_hosts:
                    - name: backend
                      domains:
                        - "*"
                      routes:
                        - match:
                            prefix: "/"
                          route:
                            cluster: service
                http_filters:
                  - name: envoy.filters.http.router
                    typed_config:
                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router

  clusters:
    - name: service
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      upstream_connection_options:
        tcp_keepalive:
          keepalive_probes: 1
          keepalive_time: 10
          keepalive_interval: 10
      load_assignment:
        cluster_name: service1
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 8000
    - name: exporter
      per_connection_buffer_limit_bytes: 32768 # 32 KiB
      connect_timeout:
        seconds: 5
      dns_lookup_family: V4_ONLY
      type: STRICT_DNS
      lb_policy: ROUND_ROBIN
      typed_extension_protocol_options:
        envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
          "@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
          common_http_protocol_options: { }
          upstream_http_protocol_options: { }
          explicit_http_config:
            http2_protocol_options:
              max_concurrent_streams: 100
              initial_stream_window_size: 65536 # 64 KiB
              initial_connection_window_size: 1048576 # 1 MiB
      upstream_connection_options:
        tcp_keepalive:
          keepalive_probes: 1
          keepalive_time: 10
          keepalive_interval: 10
      load_assignment:
        cluster_name: exporter
        endpoints:
          - lb_endpoints:
              - endpoint:
                  address:
                    socket_address:
                      address: 127.0.0.1
                      port_value: 9000
admin:
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9001
layered_runtime:
  layers:
    - name: static_layer_0
      static_layer:
        envoy:
          resource_limits:
            listener:
              example_listener_name:
                connection_limit: 10000

#stats_config:
#  histogram_bucket_settings:
#    - match:
#        safe_regex:
#          google_re2: { }
#          regex: ".*?"
#      buckets:
#        - 0.5
#        - 1
#        - 5
#        - 10
#        - 25
#        - 50
#        - 100
#        - 250
#        - 500
#        - 1000
#        - 2500
#        - 5000
#        - 10000
#        - 30000
#        - 60000
#        - 300000
#        - 600000
#        - 1800000
#        - 3600000

stats_sinks:
  - name: envoy.stat_sinks.metrics_service
    typed_config:
      "@type": type.googleapis.com/envoy.config.metrics.v3.MetricsServiceConfig
      transport_api_version: V3
      emit_tags_as_labels: true
      grpc_service:
        envoy_grpc:
          cluster_name: exporter

Jun 22 '22 15:06 marcosrmendezthd

then start a simple python server that will serve as the upstream

python3 -m http.server

and run this go program to serve as the metrics sink: go run main.go

package main

import (
	"flag"
	"fmt"
	io_prometheus_client "github.com/prometheus/client_model/go"
	"io"
	"log"
	"net"
	"strings"

	v3 "github.com/envoyproxy/go-control-plane/envoy/service/metrics/v3"
	"google.golang.org/grpc"
)

type sink struct{}

func (s *sink) StreamMetrics(stream v3.MetricsService_StreamMetricsServer) error {
	log.Println("started stream")
	for {
		msg, recvErr := stream.Recv()
		if recvErr == io.EOF {
			log.Println("finished stream")
			return nil
		}
		if recvErr != nil {
			log.Printf("ERROR: %v\n", recvErr)
			return recvErr
		}
		for _, mf := range msg.EnvoyMetrics {
			if mf == nil {
				continue
			}
			metricType := mf.GetType()
			if metricType != io_prometheus_client.MetricType_SUMMARY &&
				metricType != io_prometheus_client.MetricType_HISTOGRAM {
				continue
			}

			for _, metric := range mf.Metric {
				if metric == nil {
					continue
				}
				if metricType == io_prometheus_client.MetricType_HISTOGRAM {
					if metric.Histogram == nil || len(metric.Histogram.Bucket) == 0 || metric.Histogram.SampleSum == nil || metric.Histogram.SampleCount == nil ||
						metric.Histogram.GetSampleCount() == 0 || metric.Histogram.GetSampleSum() == 0 {
						if metric.Summary == nil {
							log.Println("empty histogram")
						} else {

							log.Println("histogram", getMetricName(mf.GetName(), metric), metric.Histogram.String())
						}
					}
				}
				if metricType == io_prometheus_client.MetricType_SUMMARY {
					if metric.Summary == nil || metric.Summary.SampleSum == nil || metric.Summary.SampleCount == nil ||
						metric.Summary.GetSampleCount() == 0 || metric.Summary.GetSampleSum() == 0 {
						if metric.Summary == nil {
							log.Println("empty summary")
						} else {
							log.Println("summary", getMetricName(mf.GetName(), metric), metric.Summary.String())
						}
					}
				}
			}
		}
	}
}

func getMetricName(name string, metric *io_prometheus_client.Metric) string {
	if metric == nil {
		return ""
	}
	if len(name) == 0 {
		return ""
	}
	result := name + "{"
	for _, label := range metric.Label {
		if label == nil {
			continue
		}
		name := strings.TrimSpace(label.GetName())
		if len(name) == 0 {
			continue
		}
		result += fmt.Sprintf("%q=%q", name, strings.TrimSpace(label.GetValue())) + ","
	}
	result = strings.TrimSuffix(result, ",")
	result += "}"
	return result
}

var (
	addr = flag.String("address", "0.0.0.0:9000", "grpc address to listen on")
)

func main() {

	flag.Parse()

	grpcServer := grpc.NewServer()
	v3.RegisterMetricsServiceServer(grpcServer, &sink{})

	l, listenErr := net.Listen("tcp", *addr)
	if listenErr != nil {
		log.Fatalf("ERROR: %v", listenErr)
	}

	log.Printf("listening on %s", *addr)
	if serveErr := grpcServer.Serve(l); serveErr != nil {
		if serveErr != grpc.ErrServerStopped {
			log.Fatalf("ERROR: %v", serveErr)
		}
	}
	log.Printf("finished")
}

Jun 22 '22 15:06 marcosrmendezthd

cc @ramaraochavali @jmarantz

Jun 23 '22 21:06 ggreenway

any updates on this?

Jun 29 '22 22:06 marcosrmendezthd

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Jul 30 '22 00:07 github-actions[bot]

any updates @jmarantz @ramaraochavali @ggreenway ?

Jul 31 '22 14:07 marcosrmendezthd

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Sep 08 '22 16:09 github-actions[bot]

any updates?

Sep 08 '22 17:09 marcosrmendezthd

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Oct 09 '22 00:10 github-actions[bot]

any updates?

Oct 12 '22 12:10 marcosrmendezthd

TBH I had not even heard of the envoy metrics service grpc endpoint. Is it possible that @lizan knows who owns this?

Oct 12 '22 12:10 jmarantz

@jmarantz We built it long time ago and I know the code. @marcosrmendezthd Sorry. I missed this. I will take look tomorrow.

Oct 12 '22 13:10 ramaraochavali

@jmarantz We built it long time ago and I know the code. @marcosrmendezthd Sorry. I missed this. I will take look tomorrow.

thank you!!!!!!!

Oct 12 '22 14:10 marcosrmendezthd

Ah.. Are you looking for prometheus metrics? grpc metrics service does not send metrics using prometheus format. It uses these quantiles and using these buckets

If you want prometheus metrics, you have to scrape from this admin endpoint

Sorry we do not use this in prometheus format.

Oct 13 '22 09:10 ramaraochavali

so based on https://github.com/envoyproxy/envoy/blob/d142c9d55ae9aab34e9924aa25f20bd27635e060/test/extensions/stats_sinks/metrics_service/grpc_metrics_service_impl_test.cc the sink is supposed to be sending histograms and summaries. why are they empty? the metrics are emitted based on the prometheus proto => https://github.com/envoyproxy/envoy/blob/9d5627a0879b0a029e90515137c108e1d2884bfc/api/envoy/service/metrics/v3/metrics_service.proto#L7, as I'm able to read gauges and counters. @ramaraochavali

Oct 14 '22 18:10 marcosrmendezthd

@marcosrmendezthd Yes. It is supposed to send histograms. But as I mentioned earlier, it was not implemented for sending metrics in prometheus format (it uses prometheus proto even for sending metrics in non prometheus format because it is standard proto for transporting metrics via gRPC). Histograms needed special logic to print in the `/prometheus' format . @suhailpatel added that support in /prometheus endpoint . My guess is that logic is missing here. As I mentioned earlier, Prometheus metrics are always expected to me scraped via /metrics endpoint as that is the standard mechanism for prometheus.

If you want metrics service to handle that, we may need some fixes there similar to what @suhailpatel did. @suhailpatel to keep me honest (because we have never used metrics service for prometheus metrics).

Oct 15 '22 03:10 ramaraochavali

right. this is what this issue is about. @ramaraochavali 😄

Oct 17 '22 23:10 marcosrmendezthd

@marcosrmendezthd as you are motivated to see this fixed, would you like to take a run at it yourself? :)

Oct 17 '22 23:10 jmarantz

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

Nov 17 '22 00:11 github-actions[bot]

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

Nov 24 '22 04:11 github-actions[bot]

envoy envoy copied to clipboard

EnvoyMetricsService receiving empty histograms and summaries

envoy
envoy copied to clipboard