Add metrics specific to bifrost-gateway setup
This is meta-issue about useful metrics in bifrost-gateway. We may ship only a subset of the below for the project Rhea.
Overview
The go-libipfs/gateway library will provide some visibility into incoming requests (1),
but we need to add metrics to track performance of saturn client (block provider) (2)
and other internals like resolution costs for different content path types and any in-memory caches we may add**(3)**.
graph LR
A(((fa:fa-person HTTP</br>clients)))
B[bifrost-gateway]
N[[fa:fa-hive bifrost-infra:<br>HTTP load-balancers<br> nginx, TLS termination]]
S(((saturn.pl<br>CDN)))
M0[( 0 <br>NGINX/LB<br/>LOGS&METRICS)]
M1[( 1 <br>HTTP<br/>METRICS:<br/> ipfs_http_*)]
M2[( 2 <br>BLOCK<br/>PROVIDER<br/>METRICS <br/>???)]
M3[( 3 <br>INTERNAL<br/>METRICS<br/>???)]
A -->| Accept: .. <br>?format=<br>Host:| N
N --> M1 --> B
N .-> M0
B --> M2 ---> S
B .-> M3
(0) are metrics tracked before bifrost-gateway and are out of scope.
Proposed metrics [WIP]
Below is a snapshot / brain dump. It is not ready yet, we want to make internal analysic/discussion before we start
For (1)
- Per request type
- Duration Histogram per request type
- We want global variant, and per namespace (/ipfs/ or /ipns/)
- See Appendinx below for example how histogram looks like
- Why?
- We need to measure each request types informed by
?format=andAcceptheader because- They have different complexity involved, and will have different latency costs
- We want to be able to see which ones are most popular, and comparing
_sumfrom histograms will allow us to see % distribution
- We need to measure
/ipfs/and/ipnsseparately to see the impact additional resolution step (IPNS or DNSLink) has.
- We need to measure each request types informed by
- Response Size Histogram per request type
- We want global variant, and per namespace (/ipfs/ or /ipns/)
- Why?
- Understanding what is the average response size allows us to correctly interpret the Duration. Without this, Duration of unixfs response does not tell us of file was big, or our stack is slow.
- Duration Histogram per request type
- Count GET vs HEAD requests
- Per each, count requests with
Cache-Control: only-if-cached- Open question (can be answered later, after we see initial counts) shoud we exclude these requests from totals? My initial suggestion is to exclude them. If they become popular, they will skew numbers, as request for 4GB file will be "insanely fast"
- Per each, count requests with
- Count 200 vs 2xx vs 3xxx vs 400 vs 500 response codes
For (2)
-
Initially, we will only request raw blocks (
application/vnd.ipld.raw) from Staurn:- Duration Histogram for block request
- Response Size Histogram for block request
- Count 200 vs non-200 response codes
-
TBD: Future (fancy
application/vnd.ipld.car)- All requests will be for resolved
/ipfs/ - We will most likely want to track:
- Duration and response size per original request type (histograms)
- If we support sub-paths, then we will need to have to track Requested Content Path length (histogram)
- All requests will be for resolved
-
TBD: if we put some sort of block cache in front of it, track HIT/MISS, probably per request type
For (3)
Place for additional internal metrics to give us more visibility into details, if we ever need to zoom-in.
- Duration Histogram for
/ipfsresolution- Why? Allows us to eyeball when resolution became the source of general slowness / regression in TTFB
- Requested Content Path length Histogram for
/ipfs- Why? We want to know % of direct requests for a CID vs
- Duration Histograms for
/ipnsresolutions of DNSLink, IPNS Record, both single lookup and recursive until/ipfs/is hit- Why?
-
bifrost-gatewaywill be delegating resolution to remote HTTP endpoint - Both can be recursive, so the metrics will be skewed unless we measure both a single lookup and recursive
- We want to be able to see which ones are most popular, and how often recursive values are present. Comparing
_sumfrom histograms will allow us to see % distribution.
-
- Why?
Appendix: how histogram from go-libipfs/gateway look like
When I mean "histogram", I mean _sum and _buckets we use in Kubo's /debug/metrics/prometheus:
Click to expand
# HELP ipfs_http_gw_raw_block_get_duration_seconds The time to GET an entire raw Block from the gateway.
# TYPE ipfs_http_gw_raw_block_get_duration_seconds histogram
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.05"} 927
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.1"} 984
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.25"} 1062
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="0.5"} 1067
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="1"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="2"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="5"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="10"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="30"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="60"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_bucket{gateway="ipfs",le="+Inf"} 1068
ipfs_http_gw_raw_block_get_duration_seconds_sum{gateway="ipfs"} 19.696413685999993
ipfs_http_gw_raw_block_get_duration_seconds_count{gateway="ipfs"} 1068
We can change the bucket distribution if that gives us better data, but it should be done on both ends.