feat(auto-scale-up): add fake online, server status cache, and seamless connection during scale-up
This PR introduces a feature to display a configurable "fake online" MOTD when a backend server is offline and auto-scale-up is enabled. When a status ping is received and the backend is not available, the router responds with a custom MOTD instead of appearing offline. It also "hang" the user while retrying to connect to the backend until it reach a defined timeout and then close the connection. Overall improving the user experience during backend startup.
Related: #36 #21
Key Changes
For fake online:
- Adds configuration options for enabling/disabling the fake online MOTD and customizing its text.
- Ensures the custom MOTD is only sent if auto-scale-up is enabled for the backend.
For status requests:
- Tries to connect to the backend once with a short timeout
- If the backend is unavailable, returns the fake online MOTD.
For connect requests:
- If auto-scaler is enabled, the client connection is kept open ("hangs") while retrying backend connection until a configurable timeout is reached.
- If auto-scaler is disabled, only a single backend connection attempt is made.
- Lays the groundwork for further enhancements, such as caching backend status and dynamic MOTD updates.
Questions
Version Response
Currently, the status response uses:
Version: StatusVersion{
Name: "1.21.5",
Protocol: 770,
}
Should the version and protocol be configurable, or is it fine to hardcode the latest supported version? What version/protocol should we default to for best compatibility?
Next Steps
- The ultimate goal is to cache backend server status and possibly enhance the MOTD to indicate when a server is sleeping or waking up.
Regarding questions:
Version Response
It would be ideal to ping the backend once at startup and retain its version/protocol; however, in the spirit of incremental PRs I'm fine with it being hardcoded at first. I'll leave it up to you for this initial PR.
Thanks for your input @itzg
It seems that waker is non-nil even with auto scale up disabled.
Do you think pass the global autoScaleUp config value as a field in Connector struct is our best option?
Like the fakeOnline config, but we’re starting to pass a lot of individual configuration variables
It seems that
wakeris non-nil even with auto scale up disabled. Do you think pass the global autoScaleUp config value as a field inConnectorstruct is our best option? Like thefakeOnlineconfig, but we’re starting to pass a lot of individual configuration variables
Yeah, it now looks like it would be good to bundle up the config parameters to avoid the argument/field growth. Since waker is always non-nil, are you pondering shifting the new fields into that? If so, that sounds good to me.
I'm pretty new to golang. Sorry if the code is a bit messy.
I think that my time.Sleep(backendRetryInterval) isn't something to do as it block the whole main thread?
There is a lot of repeating code, lot of space for improvement
I'm pretty new to golang. Sorry if the code is a bit messy. I think that my
time.Sleep(backendRetryInterval)isn't something to do as it block the whole main thread? There is a lot of repeating code, lot of space for improvement
That's totally fine. I'll point out comments where I can and can push changes to the PR if there's any that are too hard to explain.
Sorry for the delay getting back to your PR. I'm focusing on #405 and then after that's merged can shift to yours.
Sorry for the delay getting back to your PR. I'm focusing on https://github.com/itzg/mc-router/pull/405 and then after that's merged can shift to yours.
No problem 😃 I'll work on the refactor of the config in the meantime
Previous PR is now merged, so now I can focus on yours after you get merge conflicts caught up.
Conflicts should be fixed @itzg
Thanks @alexfrs69
@Sammcb since you've been in some this code recently I was wondering if you could help look at this PR also?
@alexfrs69 just noticed the PR is still in draft. Want to mark it ready for review?
Thanks for the reviews @itzg @Sammcb!
I've addressed your requested changes, please take another look 🙂 I haven't tested thoroughly yet, but I noticed an issue:
If a client repeatedly refreshes the server status page, each status request is handled sequentially, causing a backlog and significant response delays. This could potentially lead to a DoS situation. How do you suggest we address this?
On the cache side, I think it would be better to:
- Cache all backends at startup
- On each player status request, refresh the cache only if the TTL has expired
- When a downscale is triggered, cache the status before setting replicas to 0
This should reduce backend overhead. What do you think?
@alexfrs69 So to understand the high level goals. The fake online (and associated caching) and fake online MOTD really only comes into play in the case where a mapping for a backend exists, but that backend is not reachable and the user has auto scale up enabled?
If that is the case, my understanding is the final goal is to capture information about a backend server's state to display in status requests to the player when that server is down. One case I don't think is solved by this is if all backend servers are scaled down when mc-router starts up. Unless it wakes up all backend servers, it would not be able to get info to show to players in fake statuses. Additionally, there would be similar cases where a user might take down a backend server to make updates (version upgrades, config changes, etc.) and then leave it down until a player tries to connect. In this case, these updates might not match the status shown to players.
I'm wondering what your thoughts would be on instead of trying to get all this info from the server, simply adding a config file the user could supply with fallback status info? Obviously, this could drift from the running server status too and would be up to the user running the servers/router to keep things synced.
Thank you for doing so much extensive testing and research on these changes so far! Just wanted to suggest the config option route as I recently ran into this issue when wanting to setup allow/deny lists for server autoscalers. I realized it would be very difficult to get the information from the servers directly so opted to simply allow users to enter it as a config for the mc-router. I don't mean to intrude on your use case though, just something I thought of as another approach.
So to understand the high level goals. The fake online (and associated caching) and fake online MOTD really only comes into play in the case where a mapping for a backend exists, but that backend is not reachable and the user has auto scale up enabled?
Exactly
I like the idea! I definitely considered the limitation that mc-router can't get the status if the backend is stopped. Relying only on configuration for status at scale could be a bit heavy to manage, though.
We could accept this limitation for now, or, as you suggested, add support for config files so users can provide fallback status info for backends that aren't in the cache. This way, if a status isn't cached, we can return the configured fallback. It offers flexibility for those who need it, without requiring config management for every backend by default.
Yeah, I think either accepting that as a limitation or adding the config file input could work!
In terms of getting the data dynamically, I really like this approach (made some slight modifications to your initial suggestion just to clarify):
- Cache all (available) backends at startup
- On each player status request, refresh the cache only if the TTL has expired or an entry in the cache does not exist
- When a downscale is triggered, cache the status before setting replicas to 0
I can work on implementing the config files tomorrow 🙂
It would also be great to automatically detect any new backends added via k8s labels and cache their status right away. Likewise, if a backend is deleted from the mappings, we should remove its cache entry as well.
If you hook that into the router CreateMapping, etc. then that would be totally doable! (I did something very similar for the down scaler in #405)
If a client repeatedly refreshes the server status page, each status request is handled sequentially, causing a backlog and significant response delays. This could potentially lead to a DoS situation. How do you suggest we address this?
That's somewhat by design since Connector.acceptConnections runs its accept-loop in a single go routine. I couldn't see in the Go stdlib docs, but I was trying to find where the accept queue length could be specified. In any case, the OS layer I believe should constrain the pending connections to accept, so I think it's safe enough to not explicitly handle.
Hey guys, just letting you know that i dont have time to invest in this PR this week. I'll work on it surely next week 🙂
Hi everyone. I hate to be that guy, but I was wondering about the state of this PR? Do you still intend to finish this feature @alexfrs69? I'd love to see this implemented. I've recently moved from a docker-compose setup with lazymc, which showed a nice motd and kept the user hanging while the server was starting up, over to kubernetes and the experience of the server showing up as offline and getting kicked immediately when the server was starting up, led to a lot of confusion amongst my friends. Thank you for all the effort you all have put in so far!
Perhaps we need to scale back the features and behavior to simply what @L-Wehmschulte said
showed a nice motd and kept the user hanging
I had forgotten a lot of this enhancement and from the comments it looks like we were getting into more complex scenarios, caching backend status info.
I'll more this weekend myself to refresh my memory.
Perhaps we need to scale back the features and behavior to simply what @L-Wehmschulte said
showed a nice motd and kept the user hanging
I had forgotten a lot of this enhancement and from the comments it looks like we were getting into more complex scenarios, caching backend status info.
I'll more this weekend myself to refresh my memory.
That would be awesome. Thank you so much!
Hi everyone, sorry I haven’t been able to finish this PR yet. I’ll try to get back to it whenever I have some free time, maybe this weekend.
Most of the features are working fine, but it still needs some polishing, especially around implementing backend status config, a status cache according to our plan, and thoroughly testing everything out (Minecraft timeouts, status protocol, etc.).