gloo icon indicating copy to clipboard operation
gloo copied to clipboard

proxy_latency_response_out is always equal to DURATION time of the request

Open edubonifs opened this issue 1 year ago • 5 comments

Gloo Edge Product

Enterprise

Gloo Edge Version

1.15.x

Kubernetes Version

1.26

Describe the bug

Testing with proxyLatency response_out, I observed that

proxy_latency_response_out: '%DYNAMIC_METADATA(io.solo.filters.http.proxy_latency:response_out)%'

Is always equal to:

DURATION: '%DURATION%'

Expected Behavior

For me, latency for response_out should only measure the latency on response time, not having the same value as duration of the total request.

Steps to reproduce the bug

Gateway object:

apiVersion: gateway.solo.io/v1
kind: Gateway
metadata:
  annotations:
  labels:
    app: gloo
  name: gateway-proxy
  namespace: gloo-system
spec:
  bindAddress: '::'
  bindPort: 8080
  httpGateway:
    options:
      proxyLatency:
        measureRequestInternally: true
        request: LAST_INCOMING_LAST_OUTGOING
        response: LAST_INCOMING_LAST_OUTGOING
  options:
    accessLoggingService:
      accessLog:
      - fileSink:
          jsonFormat:
            DURATION: '%DURATION%'
            proxy_latency_response_out: '%DYNAMIC_METADATA(io.solo.filters.http.proxy_latency:response_out)%'
          path: /dev/stdout
  proxyNames:
  - gateway-proxy
  ssl: false
  useProxyProto: false

Inject some fault injection on request:

apiVersion: gateway.solo.io/v1
kind: VirtualService
metadata:
  name: vs
  namespace: gloo-system
spec:
  virtualHost:
    domains:
    - '*'
    routes:
    - matchers:
      - prefix: /
      options:
        faults:
          delay:
            fixedDelay: 2s
            percentage: 100
      routeAction:
        single:
          upstream:
            name: httpbin-httpbin-8000
            namespace: gloo-system

As you see, I have a delay of 2s on the request, however, if I check my logs:

{"DURATION":2006,"proxy_latency_response_out":2006}

proxy_latency_response_out is always taking the total response time

Additional Environment Detail

No response

Additional Context

No response

┆Issue is synchronized with this Asana task by Unito

edubonifs avatar Apr 03 '24 14:04 edubonifs

Reproducer project: https://github.com/DuncanDoyle/ge-gloo-9317

DuncanDoyle avatar Apr 17 '24 11:04 DuncanDoyle

Proxy latency definitely needs more docs.

However this is what it does: it gets the total time in the proxy so that it can report what the overhead of using the proxy is.

In this case the reproduction steps involve setting a delay via the fault filter https://www.envoyproxy.io/docs/envoy/latest/configuration/http/http_filters/fault_filter which can do things like simulate a delay on a request.

Unfortunately in this case we have the following issue fault filter is technically in envoy and therefore counts towards proxy latency filters time.

EDIT: removing a section previously here about nighthawk. While it had mostly correct info it actually had a piece that was not quite right. Clarifying comment below.

nfuden avatar Apr 24 '24 13:04 nfuden

Reopened for two parts.

Issue as stated here: DURATION versus request_out how are they different. Today the dynamic metadata of request_out is recorded as the first byte sent to the downstream while DURATION is when the whole requestresponse is done. This means that the difference in the two is minimal.

Gap: Proxy latency filter today emits 2 types of artifacts. It emits 3 fields in DYNAMIC_METADATA which are mainly for internal use. These are as follows 1.response_out: the time when the first byte is sent downstream 2. request_out: the time when the first byte is sent upstream 3.request_out_internal: the time until the first byte of headers is processed Proxy latency filter also emits STATS as an artifact: These are proxy_latency.rq_proxy_latency and proxy_latency.resp_proxy_latency the exact value of these is tunable via configuration as found here https://docs.solo.io/gloo-edge/latest/reference/api/github.com/solo-io/gloo/projects/gloo/api/external/envoy/extensions/proxylatency/proxylatency.proto.sk/#measurement

Possible follow ups:

  1. Improve documentation on this in main docs
  2. Expose the stats as metadata as well

nfuden avatar May 05 '24 13:05 nfuden

After further investigation it seems like we should mainly push users to recreate these stats if needed in metadata. Most of the possible setups we have with stats can be derived via the current metadata from this along with the upstream operators REQUEST_TX_DURATION RESPONSE_TX_DURATION

nfuden avatar May 06 '24 21:05 nfuden

Put up PR for this.

Includes some additional documentation around how this plays with commun_duration https://github.com/envoyproxy/envoy/pull/33240

nfuden avatar May 15 '24 14:05 nfuden