mc-router feat(auto-scale-up): add fake online, server status cache, and seamless connection during scale-up

This PR introduces a feature to display a configurable "fake online" MOTD when a backend server is offline and auto-scale-up is enabled. When a status ping is received and the backend is not available, the router responds with a custom MOTD instead of appearing offline. It also "hang" the user while retrying to connect to the backend until it reach a defined timeout and then close the connection. Overall improving the user experience during backend startup.

Related: #36 #21

Key Changes

For fake online:

Adds configuration options for enabling/disabling the fake online MOTD and customizing its text.
Ensures the custom MOTD is only sent if auto-scale-up is enabled for the backend.

For status requests:

Tries to connect to the backend once with a short timeout
If the backend is unavailable, returns the fake online MOTD.

For connect requests:

If auto-scaler is enabled, the client connection is kept open ("hangs") while retrying backend connection until a configurable timeout is reached.
If auto-scaler is disabled, only a single backend connection attempt is made.
Lays the groundwork for further enhancements, such as caching backend status and dynamic MOTD updates.

Questions

Version Response

Currently, the status response uses:

Version: StatusVersion{
    Name:     "1.21.5",
    Protocol: 770,
}

Should the version and protocol be configurable, or is it fine to hardcode the latest supported version? What version/protocol should we default to for best compatibility?

Next Steps

The ultimate goal is to cache backend server status and possibly enhance the MOTD to indicate when a server is sleeping or waking up.

Apr 29 '25 02:04 alexfrs69

Regarding questions:

Version Response

It would be ideal to ping the backend once at startup and retain its version/protocol; however, in the spirit of incremental PRs I'm fine with it being hardcoded at first. I'll leave it up to you for this initial PR.

Apr 29 '25 23:04 itzg

Thanks for your input @itzg

It seems that waker is non-nil even with auto scale up disabled. Do you think pass the global autoScaleUp config value as a field in Connector struct is our best option? Like the fakeOnline config, but we’re starting to pass a lot of individual configuration variables

Apr 29 '25 23:04 alexfrs69

It seems that waker is non-nil even with auto scale up disabled. Do you think pass the global autoScaleUp config value as a field in Connector struct is our best option? Like the fakeOnline config, but we’re starting to pass a lot of individual configuration variables

Yeah, it now looks like it would be good to bundle up the config parameters to avoid the argument/field growth. Since waker is always non-nil, are you pondering shifting the new fields into that? If so, that sounds good to me.

Apr 30 '25 00:04 itzg

I'm pretty new to golang. Sorry if the code is a bit messy. I think that my time.Sleep(backendRetryInterval) isn't something to do as it block the whole main thread? There is a lot of repeating code, lot of space for improvement

Apr 30 '25 00:04 alexfrs69

I'm pretty new to golang. Sorry if the code is a bit messy. I think that my time.Sleep(backendRetryInterval) isn't something to do as it block the whole main thread? There is a lot of repeating code, lot of space for improvement

That's totally fine. I'll point out comments where I can and can push changes to the PR if there's any that are too hard to explain.

Apr 30 '25 01:04 itzg

Sorry for the delay getting back to your PR. I'm focusing on #405 and then after that's merged can shift to yours.

May 01 '25 13:05 itzg

Sorry for the delay getting back to your PR. I'm focusing on https://github.com/itzg/mc-router/pull/405 and then after that's merged can shift to yours.

No problem 😃 I'll work on the refactor of the config in the meantime

May 01 '25 15:05 alexfrs69

Previous PR is now merged, so now I can focus on yours after you get merge conflicts caught up.

May 02 '25 23:05 itzg

Conflicts should be fixed @itzg

May 03 '25 01:05 alexfrs69

Thanks @alexfrs69

@Sammcb since you've been in some this code recently I was wondering if you could help look at this PR also?

May 03 '25 13:05 itzg

@alexfrs69 just noticed the PR is still in draft. Want to mark it ready for review?

May 03 '25 13:05 itzg

Thanks for the reviews @itzg @Sammcb!

I've addressed your requested changes, please take another look 🙂 I haven't tested thoroughly yet, but I noticed an issue:

If a client repeatedly refreshes the server status page, each status request is handled sequentially, causing a backlog and significant response delays. This could potentially lead to a DoS situation. How do you suggest we address this?

On the cache side, I think it would be better to:

Cache all backends at startup
On each player status request, refresh the cache only if the TTL has expired
When a downscale is triggered, cache the status before setting replicas to 0

This should reduce backend overhead. What do you think?

May 03 '25 22:05 alexfrs69

@alexfrs69 So to understand the high level goals. The fake online (and associated caching) and fake online MOTD really only comes into play in the case where a mapping for a backend exists, but that backend is not reachable and the user has auto scale up enabled?

If that is the case, my understanding is the final goal is to capture information about a backend server's state to display in status requests to the player when that server is down. One case I don't think is solved by this is if all backend servers are scaled down when mc-router starts up. Unless it wakes up all backend servers, it would not be able to get info to show to players in fake statuses. Additionally, there would be similar cases where a user might take down a backend server to make updates (version upgrades, config changes, etc.) and then leave it down until a player tries to connect. In this case, these updates might not match the status shown to players.

I'm wondering what your thoughts would be on instead of trying to get all this info from the server, simply adding a config file the user could supply with fallback status info? Obviously, this could drift from the running server status too and would be up to the user running the servers/router to keep things synced.

Thank you for doing so much extensive testing and research on these changes so far! Just wanted to suggest the config option route as I recently ran into this issue when wanting to setup allow/deny lists for server autoscalers. I realized it would be very difficult to get the information from the servers directly so opted to simply allow users to enter it as a config for the mc-router. I don't mean to intrude on your use case though, just something I thought of as another approach.

May 04 '25 01:05 Sammcb

So to understand the high level goals. The fake online (and associated caching) and fake online MOTD really only comes into play in the case where a mapping for a backend exists, but that backend is not reachable and the user has auto scale up enabled?

Exactly

I like the idea! I definitely considered the limitation that mc-router can't get the status if the backend is stopped. Relying only on configuration for status at scale could be a bit heavy to manage, though.

We could accept this limitation for now, or, as you suggested, add support for config files so users can provide fallback status info for backends that aren't in the cache. This way, if a status isn't cached, we can return the configured fallback. It offers flexibility for those who need it, without requiring config management for every backend by default.

May 04 '25 01:05 alexfrs69

Yeah, I think either accepting that as a limitation or adding the config file input could work!

In terms of getting the data dynamically, I really like this approach (made some slight modifications to your initial suggestion just to clarify):

Cache all (available) backends at startup
On each player status request, refresh the cache only if the TTL has expired or an entry in the cache does not exist
When a downscale is triggered, cache the status before setting replicas to 0

May 04 '25 01:05 Sammcb

I can work on implementing the config files tomorrow 🙂

It would also be great to automatically detect any new backends added via k8s labels and cache their status right away. Likewise, if a backend is deleted from the mappings, we should remove its cache entry as well.

May 04 '25 01:05 alexfrs69

If you hook that into the router CreateMapping, etc. then that would be totally doable! (I did something very similar for the down scaler in #405)

May 04 '25 01:05 Sammcb

If a client repeatedly refreshes the server status page, each status request is handled sequentially, causing a backlog and significant response delays. This could potentially lead to a DoS situation. How do you suggest we address this?

That's somewhat by design since Connector.acceptConnections runs its accept-loop in a single go routine. I couldn't see in the Go stdlib docs, but I was trying to find where the accept queue length could be specified. In any case, the OS layer I believe should constrain the pending connections to accept, so I think it's safe enough to not explicitly handle.

May 04 '25 16:05 itzg

Hey guys, just letting you know that i dont have time to invest in this PR this week. I'll work on it surely next week 🙂

May 06 '25 04:05 alexfrs69

Hi everyone. I hate to be that guy, but I was wondering about the state of this PR? Do you still intend to finish this feature @alexfrs69? I'd love to see this implemented. I've recently moved from a docker-compose setup with lazymc, which showed a nice motd and kept the user hanging while the server was starting up, over to kubernetes and the experience of the server showing up as offline and getting kicked immediately when the server was starting up, led to a lot of confusion amongst my friends. Thank you for all the effort you all have put in so far!

Oct 10 '25 05:10 L-Wehmschulte

Perhaps we need to scale back the features and behavior to simply what @L-Wehmschulte said

showed a nice motd and kept the user hanging

I had forgotten a lot of this enhancement and from the comments it looks like we were getting into more complex scenarios, caching backend status info.

I'll more this weekend myself to refresh my memory.

Oct 10 '25 12:10 itzg

Perhaps we need to scale back the features and behavior to simply what @L-Wehmschulte said

showed a nice motd and kept the user hanging

I had forgotten a lot of this enhancement and from the comments it looks like we were getting into more complex scenarios, caching backend status info.

I'll more this weekend myself to refresh my memory.

That would be awesome. Thank you so much!

Oct 10 '25 14:10 L-Wehmschulte

Hi everyone, sorry I haven’t been able to finish this PR yet. I’ll try to get back to it whenever I have some free time, maybe this weekend.

Most of the features are working fine, but it still needs some polishing, especially around implementing backend status config, a status cache according to our plan, and thoroughly testing everything out (Minecraft timeouts, status protocol, etc.).

Oct 16 '25 17:10 alexfrs69