Paper
Paper copied to clipboard
[Enhancement] Performance improvements for startup and chunk/world loading
Timings or Profile link
https://example.org/not-a-timing-issue
Description of issue
Just gonna leave this here (for Paper 1.19.2): On hosts with cpus <= 5, speeds up server startup by ~4x. For me in a 1 CPU (2,6Ghz) vm from 32s (3 pre generated worlds, no plugins) to ~8s. Loading worlds with multiverse is sped up about ~60x. On a server with 8 cores startup time for 100 worlds was ~16 min, now it is 16s. Servers with higher cpu counts still achieve good results. Chunk loading is now also surprisingly fast.
Took me 10 hours of debugging today. I found out that startup time of paper increases by ~4x when cpu count in a vm is reduced under 6. By inspecting code I then thought about increasing scheduler threads slightly. Worked like a charm instantly. In world loading on 1.19.2, multiple worlds loaded by multiverse (Bukkit.createWorld(...)) block each other consecutively (and for some reason time increases exponentially). Also chunk loading for players becomes slow. By assigning a new thread pool with 3 threads per world, the linux scheduler can care about the scheduling, threads are not spammed to much (300 threads for 100 worlds are not such a big deal) and every bit of performance can be where the players are. Works really well for us so far - performance is as good as pre 1.18. So I thought I'd share.
Patch: https://paste.gg/c3b387e7b19842b2afa94457efb27c19
Plugin and Datapack List
None
Server config files
Default
Paper version
All 1.19.2 (version is private fork)
Other
No response
It is obviously not meant to be directly merged. But with some configuration options for the user, it could be a good patch for the main paper repo. I would recommend the same defaults for everyone, they achieve the best results I tested so far on 8 cores with 8GB ram and also 1 core with 1500MB ram.
why would you have a keepalive of 0L in your thread pools?
You're also creating the pool already starting with the max pool size
Yes, to not keep reallocating threads at runtime. The keepalive of 0L just means that I want to keep the exact amount of threads. I do not really care for idle threads, thread allocation takes way longer than just keeping threads in idle state. Also on server startup, it is good if as many threads as possible (to a certain degree of course) participate in chunk loading. Can possibly get more out of a many (>=8) cpu thread server than having 4 worker threads. I do not really care about server load in my environment, as the minecraft server is the only userspace process running (vm in kube). In a low (<=2) cpu thread server, the load can be distributed more evenly among the worlds, by simply using the linux cfs. Without these threads one world can effectively block another world from loading stuff (by simply blocking or spamming the queue), even though it is technically independent in source.
I've not been fond of the maths there for a good while, but, there's many dozen caveats with the entire system and peoples needs that there is no one fits all solution, I think ideally one would just abstract this out into different strategies for different needs, "throw all the threads at it" is not a tenable solution for shared hosts, for example; yet, here we generally harm the performance of everybody because we need to cater for those environments, which ofc creates a general huge arse headache;
I see at the end you are replacing the ServerLevel executor. Isn't that a little dangerous if plugins listen to events caused by chunk load callbacks like ChunkLoadEvent, since then the event is not called synchronously?
This has been (most likely) indirectly resolved by a complete recode of the chunk system.
However, if you still think that improvements could be made in a certain area even with this system please feel free to comment and this issue can be reopened.
Thank you! 😄
@Owen1212055 Interestingly, even though chunk load performance has increased a lot with the rewrite, this patch still speeds it up even more. We are still getting huge performance benefits from it.