Add metrics specific to bifrost-gateway setup

Open lidel opened this issue 2 years ago • 0 comments

This is meta-issue about useful metrics in bifrost-gateway. We may ship only a subset of the below for the project Rhea.

Overview

The go-libipfs/gateway library will provide some visibility into incoming requests (1), but we need to add metrics to track performance of saturn client (block provider) (2) and other internals like resolution costs for different content path types and any in-memory caches we may add**(3)**.

graph LR
    A(((fa:fa-person HTTP</br>clients)))
    B[bifrost-gateway]
    N[[fa:fa-hive bifrost-infra:<br>HTTP load-balancers<br> nginx, TLS termination]]
    S(((saturn.pl<br>CDN)))
    M0[( 0 <br>NGINX/LB<br/>LOGS&METRICS)]
    M1[( 1 <br>HTTP<br/>METRICS:<br/> ipfs_http_*)]
    M2[( 2 <br>BLOCK<br/>PROVIDER<br/>METRICS <br/>???)]
    M3[( 3 <br>INTERNAL<br/>METRICS<br/>???)]


   A -->| Accept: .. <br>?format=<br>Host:| N


    N --> M1 --> B
    N .-> M0
    
    B --> M2 ---> S
    B .-> M3

(0) are metrics tracked before bifrost-gateway and are out of scope.

Proposed metrics [WIP]

Below is a snapshot / brain dump. It is not ready yet, we want to make internal analysic/discussion before we start

For (1)

Per request type
- Duration Histogram per request type
  - We want global variant, and per namespace (/ipfs/ or /ipns/)
  - See Appendinx below for example how histogram looks like
  - Why?
    - We need to measure each request types informed by ?format= and Accept header because
      - They have different complexity involved, and will have different latency costs
      - We want to be able to see which ones are most popular, and comparing _sum from histograms will allow us to see % distribution
    - We need to measure /ipfs/ and /ipns separately to see the impact additional resolution step (IPNS or DNSLink) has.
- Response Size Histogram per request type
  - We want global variant, and per namespace (/ipfs/ or /ipns/)
  - Why?
    - Understanding what is the average response size allows us to correctly interpret the Duration. Without this, Duration of unixfs response does not tell us of file was big, or our stack is slow.
Count GET vs HEAD requests
- Per each, count requests with Cache-Control: only-if-cached
  - Open question (can be answered later, after we see initial counts) shoud we exclude these requests from totals? My initial suggestion is to exclude them. If they become popular, they will skew numbers, as request for 4GB file will be "insanely fast"
Count 200 vs 2xx vs 3xxx vs 400 vs 500 response codes

For (2)

Initially, we will only request raw blocks (application/vnd.ipld.raw) from Staurn:
- Duration Histogram for block request
- Response Size Histogram for block request
- Count 200 vs non-200 response codes
TBD: Future (fancy application/vnd.ipld.car)
- All requests will be for resolved /ipfs/
- We will most likely want to track:
  - Duration and response size per original request type (histograms)
  - If we support sub-paths, then we will need to have to track Requested Content Path length (histogram)
TBD: if we put some sort of block cache in front of it, track HIT/MISS, probably per request type

For (3)

Place for additional internal metrics to give us more visibility into details, if we ever need to zoom-in.

Duration Histogram for /ipfs resolution
- Why? Allows us to eyeball when resolution became the source of general slowness / regression in TTFB
Requested Content Path length Histogram for /ipfs
- Why? We want to know % of direct requests for a CID vs
Duration Histograms for /ipns resolutions of DNSLink, IPNS Record, both single lookup and recursive until /ipfs/ is hit
- Why?
  - bifrost-gateway will be delegating resolution to remote HTTP endpoint
  - Both can be recursive, so the metrics will be skewed unless we measure both a single lookup and recursive
  - We want to be able to see which ones are most popular, and how often recursive values are present. Comparing _sum from histograms will allow us to see % distribution.

Appendix: how histogram from `go-libipfs/gateway` look like

When I mean "histogram", I mean _sum and _buckets we use in Kubo's /debug/metrics/prometheus:

Click to expand

# HELP ipfs_http_gw_raw_block_get_duration_seconds The time to GET an entire raw Block from the gateway.
# TYPE ipfs_http_gw_raw_block_get_duration_seconds histogram
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.05"} 927
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.1"} 984
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.25"} 1062
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.5"} 1067
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="1"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="2"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="5"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="10"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="30"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="60"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="+Inf"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_sum{gateway="ipfs"} 19.696413685999993
ipfs_http_gw_raw_block_get_duration_seconds_count{gateway="ipfs"} 1068

We can change the bucket distribution if that gives us better data, but it should be done on both ends.

Feb 06 '23 13:02 lidel

Add metrics specific to bifrost-gateway setup

Overview

Proposed metrics [WIP]

For (1)

For (2)

For (3)

Appendix: how histogram from go-libipfs/gateway look like

Appendix: how histogram from `go-libipfs/gateway` look like