rainbow
rainbow copied to clipboard
/routing/v1 http client metrics and configuration
Problem
Seems that we have hardcoded some settings related to delegated routing over HTTP
- http client pool details here
- http router timeout here https://github.com/ipfs/rainbow/blob/19723fe3c522dba0daa861bf64f02dad30fde7e2/setup.go#L273
15s timeout on cold cache might lead to undesired denial of service if content is only announced to IPNI at cid.contact, and either client or server are under load so receiving response takes more than 15s
Solution
I think we should expose http routing client metrics to see if/when things fail, and make things configurable (at least the routing timeout), and use our infra to adjust the default based on real world performance:
- [ ] expose timeout as a configuration setting, allowing us to fine-tune it on ipfs.io infra
- config option for adjusting timeout should follow whatever naming convention we end up in #113
- ipfs.io gateway infra timeouts (HTTP 504) ~1m, so I think it would not hurt if we wait for routing response bit longer than 15s
- [ ] have success/failure metrics for each defined /routing/v1 endpoint
- Needs analysis, but on the surface, it looks like we never finished this? There are error-related metrics in boxo/routing/http/client here,
but we don't seem to expose
routing_http_client_latencyon http://127.0.0.1:8091/debug/metrics/prometheus
- Needs analysis, but on the surface, it looks like we never finished this? There are error-related metrics in boxo/routing/http/client here,
but we don't seem to expose