/routing/v1 http client metrics and configuration

Open lidel opened this issue 1 year ago • 0 comments

Problem

Seems that we have hardcoded some settings related to delegated routing over HTTP

http client pool details here
http router timeout here https://github.com/ipfs/rainbow/blob/19723fe3c522dba0daa861bf64f02dad30fde7e2/setup.go#L273

15s timeout on cold cache might lead to undesired denial of service if content is only announced to IPNI at cid.contact, and either client or server are under load so receiving response takes more than 15s

Solution

I think we should expose http routing client metrics to see if/when things fail, and make things configurable (at least the routing timeout), and use our infra to adjust the default based on real world performance:

[ ] expose timeout as a configuration setting, allowing us to fine-tune it on ipfs.io infra
- config option for adjusting timeout should follow whatever naming convention we end up in #113
- ipfs.io gateway infra timeouts (HTTP 504) ~1m, so I think it would not hurt if we wait for routing response bit longer than 15s
[ ] have success/failure metrics for each defined /routing/v1 endpoint
- Needs analysis, but on the surface, it looks like we never finished this? There are error-related metrics in boxo/routing/http/client here, but we don't seem to expose routing_http_client_latency on http://127.0.0.1:8091/debug/metrics/prometheus

Apr 08 '24 22:04 lidel