panel
panel copied to clipboard
Panel will slow down drastically if one node is down
Current Behavior
If one of nodes go offline, the panel will slow down. Additionally, it'll start throwing 502 bad gateway errors every now and than.
Expected Behavior
Once panel determinates that a node is down, it should no longer try to reach it causing consistent website errors and poor performance.
Steps to Reproduce
- Take one wings off.
- Preferably completely turn off the machine it's located on or disconnect it from the internet source
- Try using the panel
(I’m using CloudFlare)
Panel Version
1.7.0
Wings Version
1.6.0
Games and/or Eggs Affected
No response
Docker Image
No response
Error Logs
https://bin.ptdl.co/q1jj5/
Is there an existing issue for this?
- [X] I have searched the existing issues before opening this issue.
- [X] I have provided all relevant details, including the specific game and Docker images I am using if this issue is related to running a server.
- [X] I have checked in the Discord server and believe this is a bug with the software, and not a configuration issue with my specific system.
Are you using Cloudflare on the Panel? Could be an explanation for the 502 errors you're receiving.
This is easy to reproduce. If I have a panel running (behind cloudflare) with 1 node up and 1 node down I will get 504 errors on 50% of refreshes hitting any panel URL.
I personally experienced this on 0.7 but have not tried to replicate it on 1.x. Based on reports in Discord (and here) I'm fairly sure it still exists.
Are you using Cloudflare on the Panel? Could be an explanation for the 502 errors you're receiving.
I am, forgot to mention. I updated my initial comment.
Are you sure this isn't Cloudflare just assuming it is down because of the 504s? I explicitly return a 504 from the API endpoints that reach out to Wings, so if Cloudflare is seeing that and calling the whole site down as a result, thats an issue with Cloudflare.
504 Gateway Timeout The server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the request.
Are you sure this isn't Cloudflare just assuming it is down because of the 504s? I explicitly return a 504 from the API endpoints that reach out to Wings, so if Cloudflare is seeing that and calling the whole site down as a result, thats an issue with Cloudflare.
504 Gateway Timeout The server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the request.
You can't just say "it's an issue with CloudFlare". Cloudflare is used by millions of websites, and just moving forward like it was nothing is not a great option. On top of periodic error pages, the panel can also freeze or get extremely slow, some servers may continue to load forever, even if not on that node etc etc etc.
I'm sure there can be a solution to this issue that would work with CloudFlare and any other CDN whatsoever.
This might be related to https://github.com/pterodactyl/panel/commit/4964d294f6b23208a00a9f153b616632dd5acebd. This was a commit from 0.7 though so I do not know if the code is still present in the latest version.
I'd say it's somewhere in
/var/www/pterodactyl/app/Repositories/Wings/DaemonServerRepository.php (suspension function)
Likely due to undefined timeout in GuzzleHttp Client. Adding a timeout did help a little, but more work needs to be done. This, of course, only made the panel a little smoother while a node was down.
Can you expand on what you mean about an undefined timeout, they're set here: https://github.com/pterodactyl/panel/blob/develop/app/Repositories/Wings/DaemonRepository.php#L72-L73
Can you please also include far more details about "panel slows down", because there isn't much I can do to debug that sentence or guide you towards better resources/ideas to try.
Can you expand on what you mean about an undefined timeout, they're set here: https://github.com/pterodactyl/panel/blob/develop/app/Repositories/Wings/DaemonRepository.php#L72-L73
Oh, didn't see that, but it explains the issue. The default timeout is 30 from config. I had success even with 1.5 but didn't test it in production. Maybe it could produce issues via such a small timeout. I've implemented timeout of 1.5 in DaemonServerRepository suspend function but didn't know that there is one in getHttpClient(). Nonetheless, it made the panel a little smoother in case of a node being down, but it was still slower, unresponsive, and with periodic lovely 502 errors popping up.
Can you please also include far more details about "panel slows down", because there isn't much I can do to debug that sentence or guide you towards better resources/ideas to try.
Alright so here's the video: https://youtu.be/YEnfterMMx8
- At 0:26 we've stopped Wings.
- At the same time, we applied a rule
iptables -A INPUT -p tcp --dport 8080 -j DROP
. This is to simulate that the system is completely down. Should the host system be reachable, we'll only get 502, and no slowdowns. - At 0:37 I've attempted a refresh on the page and experienced an extraordinarily long wait until it eventually loaded
- I attempted that a few more times after, same result
I've tried to reproduce this and cannot, I really don't know what to tell you here.
I have my entire node offline, and everything continues working exactly as I expect it should, minus the /resources
API calls that return the expected 504 errors from the API but that is simply a front-end issue.
I believe this might be an issue with your php-fpm config. I can reproduce similar results only when flooding the API endpoint with a ton of requests at once (100+ parallel on my dev environment) and then trying to do things, in which case PHP is busy working on fetching the API results.
I've tried to reproduce this and cannot, I really don't know what to tell you here.
I have my entire node offline, and everything continues working exactly as I expect it should, minus the
/resources
API calls that return the expected 504 errors from the API but that is simply a front-end issue.I believe this might be an issue with your php-fpm config. I can reproduce similar results only when flooding the API endpoint with a ton of requests at once (100+ parallel on my dev environment) and then trying to do things, in which case PHP is busy working on fetching the API results.
Did you connect it to CloudFlare first?
I’m pretty sure it’s not due to Cloudflare config. Every time I ever installed this panel on any machine, I got the same behavior. The config is the default one. We’re currently running the panel as the only software on a whole 32GB 4 core 8 threads OVH rise baremetal.
Can confirm that I have run into similar problems as @OpenSource03 I also use Cloudflare. When wings is unresponsive for one node. Then the panel becomes really slow.
My current scenario with long loading times looks like OP's, but only to some extents.
I tend to host my instances of Panel and Wings using the latest Docker images and things tend to work flawlessly—but something has changed somewhere which has introduced this slow page loading speed.
I first noticed this speed issue in a fresh installation of Panel v1.7.0. To troubleshoot, I went ahead and removed the container and cleared the database to try the image for Panel v1.6.6—but same ordeal there. One thing to note, these new instances of the Panel had yet to see any node attached to them. So, the "If one of nodes go offline, the panel will slow down." doesn't quite apply here as far as my testing has gone for the Docker-variant of the Panel.
When I tried to check if it was a bottleneck by high resource usage or an overload on the network, I couldn't find anything out of the ordinary. Looking at the system resource usage, they were low as expected on such a fresh installation, and the bandwidth usage was basically a zero.
The next step was to see if the issue persisted if I did the installation natively (a.k.a. not via Docker), and surprisingly, the issue was no longer there. Same system hardware, same host Linux distro, and same Cloudflare config. No matter if there was a node created or not, or if the created node was online or offline, the panel was responsive just like I had experienced previously when hosting Panel v1.6.6 via Docker about a month ago.
Both me and OP seem to proxy our Panel installations via Cloudflare. I can confirm that my configuration for my domain over at Cloudflare has not changed in the past few months (except DNS records), so it shouldn't be anything on Cloudflare's end—unless they've gone ahead and slowly rolled out changes to customers without making any sort of announcement of said change.
Trying to run the Panel both proxied and as "DNS only" via Cloudflare shows no difference for me at least—the issue is still there. This is something I tried both when doing the installation via Docker and the natively on the host system itself.
@DaneEveritt any news on this case?
I believe based on some testing as provided above, I am having these issues as well. Currently I have one node down (planned) and my panel has come to a crawl. Earlier this week I was experiencing the same thing but after my other node that was down (planned) came back, it worked flawlessly again.
This applies to me. Panel is behind Cloudflare. Nodes are accessible directly. Though I don't think that it is Cloudflare's issue as I was experiencing this behaviour when I ran panel w/o CF a couple of versions ago
This happens to me even when everything is not on CloudFlare, and the offline node has a lot of servers and you've toggled the view all servers button.
Any updates on this? I don't have the option of not using cloudflare due to frequent DDoSing, and this bug makes ptero neigh unusable whenever I forget to renew a cert on one of my boxes.
Any updates on this? I don't have the option of not using cloudflare due to frequent DDoSing, and this bug makes ptero neigh unusable whenever I forget to renew a cert on one of my boxes.
While not able to respond on the specific question, Highly recommend setting up certbot with autorenew for your certs so you don't have to manually renew them each time.
Any updates on this?
Same problem here, using Cloudflare too