ipfs-gui icon indicating copy to clipboard operation
ipfs-gui copied to clipboard

IPFS Infrastructure Status Page

Open olizilla opened this issue 6 years ago • 9 comments

Related to #80 we need a more holistic overview of the health of the ipfs.io infrastucture. We want to visualise how things are running in a way that give a clear overview at the top level, and lets you drill into more info for each specific service and linking out to other telemetry services (netdata, grafana) where sensible to give the full details.

A status page of some sort has been suggested... popular public ones include

  • https://status.slack.com/
  • https://www.githubstatus.com/
  • https://status.circleci.com/
  • https://status.cloud.google.com/

Some open source solutions

  • https://statusfy.co/

TODO:

  • [ ] Define the list of services
    • gateway nodes
    • bootstrap nodes
    • dhtbooster nodes
    • preload nodes
    • websocket-star and webrtc signalling infra
    • nginx / http frontend
    • certbot / tls
    • DNS / dnsimple
    • ?
  • [ ] Define regions, zones, datacenters
    • packet
    • where tho?
  • [ ] Define metrics
    • Gateway / Nginx requests over time (current gateway load)
    • nginx timeouts over time (# requests for undiscoverable content)
    • IPFS response time for local blocks
    • IPFS response time for blocks from cluster
    • IPFS response time for DHT discovery
    • Estimated unique peerIDs in network
    • total bandwidth and average bandwidth per request.
    • total infra cost?
  • [ ] Define status thresholds
    • Happy fail: requests are slow because we are getting way more than usual
    • Sad fail: requests are slow becuase something is broken... DHT discovery time just spiked, but number of unique peers didn't
    • Budget exceeded: we hit a cost threshold and started throttling specific services.

olizilla avatar Jul 03 '19 14:07 olizilla

Some good status pages

https://www.githubstatus.com

Screenshot 2019-07-10 at 14 06 41

https://status.circleci.com

Screenshot 2019-07-10 at 14 03 48

https://status.slack.com

Screenshot 2019-07-10 at 14 03 18

olizilla avatar Jul 10 '19 13:07 olizilla

Interestingly github and circle both use https://statuspage.io I am currently trying out https://docs.statusfy.co

olizilla avatar Jul 10 '19 14:07 olizilla

Here's how things could look if we go for https://statuspage.io

User view

New Incident Resolved Details
Screenshot 2019-07-11 at 13 18 53 Screenshot 2019-07-11 at 13 20 20 Screenshot 2019-07-11 at 13 20 26

Operator view

New incident Resolved Details
Screenshot 2019-07-11 at 13 18 27 Screenshot 2019-07-11 at 13 19 41 Screenshot 2019-07-11 at 13 19 53

All OK

Screenshot 2019-07-11 at 13 17 36

olizilla avatar Jul 11 '19 13:07 olizilla

I really want those health meters to pulse in a Knight Rider kind of way, but otherwise this is really nifty!

jessicaschilling avatar Jul 11 '19 17:07 jessicaschilling

I also tired out:

  • https://www.sorryapp.com/ - cheaper than statuspage.io but didn't feel as intuitive... something about it didn't click for me
New Incident View incident
Screenshot 2019-07-12 at 10 41 45 Screenshot 2019-07-12 at 10 40 39

224b78df sorryapp com_ (1)

olizilla avatar Jul 12 '19 09:07 olizilla

  • https://statusfy.co/ - good self-hosted option - gives you a cli to create incidents as mardown files. The incident status is tracked in yaml front matter, and the markdown lets you add notes and status updates. see: https://docs.statusfy.co/guide/incidents/#front-matter
Screenshot 2019-07-12 at 10 39 07

The cli builds out a static site, and then it's up to us where we want to publish it.

localhost_8000_

This could let us host it on IPFS, but I'm assuming that the network status page is the one resource we should not post on IPFS itself. We can of course host it on any static resoruce server. I've not explored it further as it seems like we'd want to have a very comfortable and clear UI for reporting incidents, as those situations are stressful enough. Creating a static site is a reliable process, an could be entirely automated via github, but I want to check in with the operators who are using it to see what there prefences are.

olizilla avatar Jul 12 '19 09:07 olizilla

This could let us host it on IPFS, but I'm assuming that the network status page is the one resource we should not post on IPFS itself.

😆 Agreed.

jessicaschilling avatar Jul 12 '19 18:07 jessicaschilling

Main storage Cluster is missing from the services list (although I see it in your screenshots).

Also, Pinbots.

hsanjuan avatar Jul 15 '19 16:07 hsanjuan

both statuspage and statusfy seem reasonable. are there other benefits to the self hosted version we like? ex the markdown or cli integrations?

momack2 avatar Aug 02 '19 01:08 momack2