envoy
envoy copied to clipboard
EnvoyMetricsService receiving empty histograms and summaries
Description:
Using Envoy 1.22.2 and hooking it up to a grpc metric sink for prometheus to scrape later. I'm noticing that while the counters and gauges are coming in correctly, the summaries and histograms are empty. The histograms have buckets, but no samples. If I check on the envoy prometheus metrics, the histograms have values and there are no summaries (not sure why I'm getting those on the grpc sink). Neither the summaries nor the histograms have any actual samples. In the configuration below I tried adding a stats_config to set the default buckets just in case, but it doesn't seem to be working. Any ideas?
% envoy --version
envoy version: c919bdec19d79e97f4f56e4095706f8e6a383f1c/1.22.2/Modified/RELEASE/BoringSSL
Admin and Stats Output:
/stats/prometheus output for the envoy_cluster_upstream_cx_connect_ms
histogram
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="0.5"} 0
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="1"} 0
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="5"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="10"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="25"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="50"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="100"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="250"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="500"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="1000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="2500"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="5000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="10000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="30000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="60000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="300000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="600000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="1800000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="3600000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="envoy_exporter",le="+Inf"} 2
envoy_cluster_upstream_cx_connect_ms_sum{envoy_cluster_name="envoy_exporter"} 4.0999999999999996447286321199499
Config:
overload_manager:
refresh_interval: 0.25s
resource_monitors:
- name: "envoy.resource_monitors.fixed_heap"
typed_config:
"@type": type.googleapis.com/envoy.extensions.resource_monitors.fixed_heap.v3.FixedHeapConfig
max_heap_size_bytes: 2147483648 # 2 GiB
actions:
- name: "envoy.overload_actions.shrink_heap"
triggers:
- name: "envoy.resource_monitors.fixed_heap"
threshold:
value: 0.95
- name: "envoy.overload_actions.stop_accepting_requests"
triggers:
- name: "envoy.resource_monitors.fixed_heap"
threshold:
value: 0.98
static_resources:
clusters:
- name: control_plane
per_connection_buffer_limit_bytes: 32768 # 32 KiB
connect_timeout:
seconds: 5
dns_lookup_family: V4_ONLY
type: STRICT_DNS
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
common_http_protocol_options: { }
upstream_http_protocol_options: { }
explicit_http_config:
http2_protocol_options:
max_concurrent_streams: 100
initial_stream_window_size: 65536 # 64 KiB
initial_connection_window_size: 1048576 # 1 MiB
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 1
keepalive_time: 10
keepalive_interval: 10
load_assignment:
cluster_name: control_plane
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 7001
- name: envoy_exporter
per_connection_buffer_limit_bytes: 32768 # 32 KiB
connect_timeout:
seconds: 5
dns_lookup_family: V4_ONLY
type: STRICT_DNS
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
common_http_protocol_options: { }
upstream_http_protocol_options: { }
explicit_http_config:
http2_protocol_options:
max_concurrent_streams: 100
initial_stream_window_size: 65536 # 64 KiB
initial_connection_window_size: 1048576 # 1 MiB
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 1
keepalive_time: 10
keepalive_interval: 10
load_assignment:
cluster_name: envoy_exporter
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 11001
dynamic_resources:
ads_config:
api_type: GRPC
transport_api_version: V3
grpc_services:
- envoy_grpc:
cluster_name: control_plane
cds_config:
resource_api_version: V3
ads: {}
lds_config:
resource_api_version: V3
ads: {}
cluster_manager:
outlier_detection:
event_log_path: "/dev/stdout"
node:
id: service-proxy
cluster: control_plane
admin:
access_log_path: /tmp/admin_access.log
address:
socket_address: { address: 0.0.0.0, port_value: 15000 }
layered_runtime:
layers:
- name: static_layer_0
static_layer:
envoy:
resource_limits:
listener:
"listener_443:0.0.0.0:443":
connection_limit: 10000
overload:
global_downstream_max_connections: 50000
#stats_config:
# histogram_bucket_settings:
# - match:
# safe_regex:
# google_re2: { }
# regex: ".*?"
# buckets:
# - 0.5
# - 1
# - 5
# - 10
# - 25
# - 50
# - 100
# - 250
# - 500
# - 1000
# - 2500
# - 5000
# - 10000
# - 30000
# - 60000
# - 300000
# - 600000
# - 1800000
# - 3600000
stats_sinks:
- name: envoy.stat_sinks.metrics_service
typed_config:
"@type": type.googleapis.com/envoy.config.metrics.v3.MetricsServiceConfig
transport_api_version: V3
emit_tags_as_labels: true
grpc_service:
envoy_grpc:
cluster_name: envoy_exporter
some screenshots of the empty histogram that is sent to the metrics service.



7:17PM WRN ignoring empty summary metric=envoy_server_initialization_time_ms original-name=server.initialization_time_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_envoy_exporter_upstream_cx_length_ms original-name=cluster.envoy_exporter.upstream_cx_length_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_envoy_exporter_upstream_cx_connect_ms original-name=cluster.envoy_exporter.upstream_cx_connect_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_control_plane_upstream_cx_length_ms original-name=cluster.control_plane.upstream_cx_length_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_manager_cds_update_duration original-name=cluster_manager.cds.update_duration
7:17PM WRN ignoring empty summary metric=envoy_listener_manager_lds_update_duration original-name=listener_manager.lds.update_duration
7:17PM WRN ignoring empty summary metric=envoy_cluster_control_plane_upstream_cx_connect_ms original-name=cluster.control_plane.upstream_cx_connect_ms
7:17PM WRN ignoring empty summary metric=envoy_server_initialization_time_ms original-name=server.initialization_time_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_envoy_exporter_upstream_cx_length_ms original-name=cluster.envoy_exporter.upstream_cx_length_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_envoy_exporter_upstream_cx_connect_ms original-name=cluster.envoy_exporter.upstream_cx_connect_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_control_plane_upstream_cx_length_ms original-name=cluster.control_plane.upstream_cx_length_ms
7:17PM WRN ignoring empty summary metric=envoy_cluster_manager_cds_update_duration original-name=cluster_manager.cds.update_duration
a better example:
this is what envoy exposes on /stats/prometheus
# TYPE envoy_cluster_upstream_cx_connect_ms histogram
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="0.5"} 0
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="1"} 0
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="5"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="10"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="25"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="50"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="100"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="250"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="500"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="1000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="2500"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="5000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="10000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="30000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="60000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="300000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="600000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="1800000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="3600000"} 2
envoy_cluster_upstream_cx_connect_ms_bucket{envoy_cluster_name="exporter",le="+Inf"} 2
envoy_cluster_upstream_cx_connect_ms_sum{envoy_cluster_name="exporter"} 6.0999999999999996447286321199499
envoy_cluster_upstream_cx_connect_ms_count{envoy_cluster_name="exporter"} 2
this is what the sample program sees:
2022/06/22 11:22:48 empty histogram
2022/06/22 11:22:48 summary cluster.upstream_cx_connect_ms{"envoy.cluster_name"="exporter"} quantile:<quantile:0 value:nan > quantile:<quantile:0.25 value:nan > quantile:<quantile:0.5 value:nan > quantile:<quantile:0.75 value:nan > quantile:<quantile:0.9 value:nan > quantile:<quantile:0.95 value:nan > quantile:<quantile:0.99 value:nan > quantile:<quantile:0.995 value:nan > quantile:<quantile:0.999 value:nan > quantile:<quantile:1 value:nan >
2022/06/22 11:22:48 empty histogram
to reproduce this run envoy 1.22.2 with the following configuration.
$ envoy --version envoy version: c919bdec19d79e97f4f56e4095706f8e6a383f1c/1.22.2/Modified/RELEASE/BoringSSL
$ envoy -c envoy.yaml
static_resources:
listeners:
- address:
socket_address:
address: 0.0.0.0
port_value: 8080
filter_chains:
- filters:
- name: envoy.filters.network.http_connection_manager
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
codec_type: AUTO
stat_prefix: ingress_http
route_config:
name: local_route
virtual_hosts:
- name: backend
domains:
- "*"
routes:
- match:
prefix: "/"
route:
cluster: service
http_filters:
- name: envoy.filters.http.router
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
clusters:
- name: service
type: STRICT_DNS
lb_policy: ROUND_ROBIN
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 1
keepalive_time: 10
keepalive_interval: 10
load_assignment:
cluster_name: service1
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 8000
- name: exporter
per_connection_buffer_limit_bytes: 32768 # 32 KiB
connect_timeout:
seconds: 5
dns_lookup_family: V4_ONLY
type: STRICT_DNS
lb_policy: ROUND_ROBIN
typed_extension_protocol_options:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
"@type": type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
common_http_protocol_options: { }
upstream_http_protocol_options: { }
explicit_http_config:
http2_protocol_options:
max_concurrent_streams: 100
initial_stream_window_size: 65536 # 64 KiB
initial_connection_window_size: 1048576 # 1 MiB
upstream_connection_options:
tcp_keepalive:
keepalive_probes: 1
keepalive_time: 10
keepalive_interval: 10
load_assignment:
cluster_name: exporter
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: 127.0.0.1
port_value: 9000
admin:
address:
socket_address:
address: 0.0.0.0
port_value: 9001
layered_runtime:
layers:
- name: static_layer_0
static_layer:
envoy:
resource_limits:
listener:
example_listener_name:
connection_limit: 10000
#stats_config:
# histogram_bucket_settings:
# - match:
# safe_regex:
# google_re2: { }
# regex: ".*?"
# buckets:
# - 0.5
# - 1
# - 5
# - 10
# - 25
# - 50
# - 100
# - 250
# - 500
# - 1000
# - 2500
# - 5000
# - 10000
# - 30000
# - 60000
# - 300000
# - 600000
# - 1800000
# - 3600000
stats_sinks:
- name: envoy.stat_sinks.metrics_service
typed_config:
"@type": type.googleapis.com/envoy.config.metrics.v3.MetricsServiceConfig
transport_api_version: V3
emit_tags_as_labels: true
grpc_service:
envoy_grpc:
cluster_name: exporter
then start a simple python server that will serve as the upstream
python3 -m http.server
and run this go program to serve as the metrics sink: go run main.go
package main
import (
"flag"
"fmt"
io_prometheus_client "github.com/prometheus/client_model/go"
"io"
"log"
"net"
"strings"
v3 "github.com/envoyproxy/go-control-plane/envoy/service/metrics/v3"
"google.golang.org/grpc"
)
type sink struct{}
func (s *sink) StreamMetrics(stream v3.MetricsService_StreamMetricsServer) error {
log.Println("started stream")
for {
msg, recvErr := stream.Recv()
if recvErr == io.EOF {
log.Println("finished stream")
return nil
}
if recvErr != nil {
log.Printf("ERROR: %v\n", recvErr)
return recvErr
}
for _, mf := range msg.EnvoyMetrics {
if mf == nil {
continue
}
metricType := mf.GetType()
if metricType != io_prometheus_client.MetricType_SUMMARY &&
metricType != io_prometheus_client.MetricType_HISTOGRAM {
continue
}
for _, metric := range mf.Metric {
if metric == nil {
continue
}
if metricType == io_prometheus_client.MetricType_HISTOGRAM {
if metric.Histogram == nil || len(metric.Histogram.Bucket) == 0 || metric.Histogram.SampleSum == nil || metric.Histogram.SampleCount == nil ||
metric.Histogram.GetSampleCount() == 0 || metric.Histogram.GetSampleSum() == 0 {
if metric.Summary == nil {
log.Println("empty histogram")
} else {
log.Println("histogram", getMetricName(mf.GetName(), metric), metric.Histogram.String())
}
}
}
if metricType == io_prometheus_client.MetricType_SUMMARY {
if metric.Summary == nil || metric.Summary.SampleSum == nil || metric.Summary.SampleCount == nil ||
metric.Summary.GetSampleCount() == 0 || metric.Summary.GetSampleSum() == 0 {
if metric.Summary == nil {
log.Println("empty summary")
} else {
log.Println("summary", getMetricName(mf.GetName(), metric), metric.Summary.String())
}
}
}
}
}
}
}
func getMetricName(name string, metric *io_prometheus_client.Metric) string {
if metric == nil {
return ""
}
if len(name) == 0 {
return ""
}
result := name + "{"
for _, label := range metric.Label {
if label == nil {
continue
}
name := strings.TrimSpace(label.GetName())
if len(name) == 0 {
continue
}
result += fmt.Sprintf("%q=%q", name, strings.TrimSpace(label.GetValue())) + ","
}
result = strings.TrimSuffix(result, ",")
result += "}"
return result
}
var (
addr = flag.String("address", "0.0.0.0:9000", "grpc address to listen on")
)
func main() {
flag.Parse()
grpcServer := grpc.NewServer()
v3.RegisterMetricsServiceServer(grpcServer, &sink{})
l, listenErr := net.Listen("tcp", *addr)
if listenErr != nil {
log.Fatalf("ERROR: %v", listenErr)
}
log.Printf("listening on %s", *addr)
if serveErr := grpcServer.Serve(l); serveErr != nil {
if serveErr != grpc.ErrServerStopped {
log.Fatalf("ERROR: %v", serveErr)
}
}
log.Printf("finished")
}
cc @ramaraochavali @jmarantz
any updates on this?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
any updates @jmarantz @ramaraochavali @ggreenway ?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
any updates?
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
any updates?
TBH I had not even heard of the envoy metrics service grpc endpoint. Is it possible that @lizan knows who owns this?
@jmarantz We built it long time ago and I know the code. @marcosrmendezthd Sorry. I missed this. I will take look tomorrow.
@jmarantz We built it long time ago and I know the code. @marcosrmendezthd Sorry. I missed this. I will take look tomorrow.
thank you!!!!!!!
Ah.. Are you looking for prometheus metrics? grpc metrics service does not send metrics using prometheus format. It uses these quantiles and using these buckets
If you want prometheus metrics, you have to scrape from this admin endpoint
Sorry we do not use this in prometheus format.
so based on https://github.com/envoyproxy/envoy/blob/d142c9d55ae9aab34e9924aa25f20bd27635e060/test/extensions/stats_sinks/metrics_service/grpc_metrics_service_impl_test.cc the sink is supposed to be sending histograms and summaries. why are they empty? the metrics are emitted based on the prometheus proto => https://github.com/envoyproxy/envoy/blob/9d5627a0879b0a029e90515137c108e1d2884bfc/api/envoy/service/metrics/v3/metrics_service.proto#L7, as I'm able to read gauges and counters. @ramaraochavali
@marcosrmendezthd Yes. It is supposed to send histograms. But as I mentioned earlier, it was not implemented for sending metrics in prometheus format (it uses prometheus proto even for sending metrics in non prometheus format because it is standard proto for transporting metrics via gRPC). Histograms needed special logic to print in the `/prometheus' format . @suhailpatel added that support in /prometheus endpoint . My guess is that logic is missing here. As I mentioned earlier, Prometheus metrics are always expected to me scraped via /metrics endpoint as that is the standard mechanism for prometheus.
If you want metrics service to handle that, we may need some fixes there similar to what @suhailpatel did. @suhailpatel to keep me honest (because we have never used metrics service for prometheus metrics).
right. this is what this issue is about. @ramaraochavali 😄
@marcosrmendezthd as you are motivated to see this fixed, would you like to take a run at it yourself? :)
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.