Add a "liveness probe" for score submissions such that outages don't cause users to be unable to play
As other efforts to attempt increasing resiliency turn out to be either a lot of work or stalled due to circumstances outside our control, a short-term solution to enable the game being playable during server outages would be to implement a sort of "liveness probe", being an endpoint independent of other infra that should be as simple and as resilient to falling over as possible, which would be used by the client to determine whether things are alive and whether to bother submitting scores or not.
Have not thought of any of the details like:
- where this should live
- how often to query this probe
- whether it should affect just score submission or e.g. instantly knock the entirety of API into
Failingstate (you could call it dubious but there's very little in the game right now that can work without web. if you want that, then read https://github.com/ppy/osu-server-spectator/issues/367 and https://github.com/ppy/osu-server-spectator/issues/368#issuecomment-3536556034).
cc @peppy
I agree that an initial implementation would knock the whole api into failing mode, because this seems like the simplest way to ensure things work (aka fail) correctly.
Having the probe on either a low tier droplet, or hosted on a completely separate non-cloudflare'd static host and updated via a script on a droplet, seems like the best method to me.
S3 would probably be the number one robust option (although incur a small cost) if we go with method two. Probably the most reliable method.
Many such probe servers already exist in the wild to assert network status and figuring out existence of captive portals on wifi hotspots, see https://antonz.org/is-online/
They will certainly be more resilient than whatever we come up with, and we could combine two just in case one of those manage to fail.
It might be good to use the google API as a first point of contact, but it's also nice to have our own endpoint where we can provide more details to the end user about the status of things (as done in the linked PR).
Many such probe servers already exist in the wild to assert network status and figuring out existence of captive portals on wifi hotspots, see https://antonz.org/is-online/
I don't think this works because the parameters here are different than "is the internet on".
Maybe I should have used "deadness probe" in the OP of this issue, because as described https://github.com/ppy/osu/pull/35752, I think the safest move is to have an endpoint that indicates that osu! is currently actively dead. Because what I want to avoid is any of the following:
- One of these external never-fail endpoints starting to fail programmatically because of
$reason. - One of these external never-fail endpoints starting to fail because of
$network_reasonthat doesn't apply to our infra (think like routing issues or something, shark eating cable, whatever) - One of these external never-fail endpoints getting IP-range-banned by
$countryand therefore making the game completely unplayable.
And those above factors completely eliminate the usability of anything third-party to me because at that point there's no signal left. There's nothing to be gleaned from the endpoint succeeding or failing. The signal I desire is us explicitly saying to our users "stuff is deaded, stop traffic until we're able to handle it". And unless that message is delivered wholesale, the probe does nothing.