panel icon indicating copy to clipboard operation
panel copied to clipboard

Panel will slow down drastically if one node is down

Open OpenSource03 opened this issue 3 years ago • 22 comments

Current Behavior

If one of nodes go offline, the panel will slow down. Additionally, it'll start throwing 502 bad gateway errors every now and than.

Expected Behavior

Once panel determinates that a node is down, it should no longer try to reach it causing consistent website errors and poor performance.

Steps to Reproduce

  1. Take one wings off.
  2. Preferably completely turn off the machine it's located on or disconnect it from the internet source
  3. Try using the panel

(I’m using CloudFlare)

Panel Version

1.7.0

Wings Version

1.6.0

Games and/or Eggs Affected

No response

Docker Image

No response

Error Logs

https://bin.ptdl.co/q1jj5/

Is there an existing issue for this?

  • [X] I have searched the existing issues before opening this issue.
  • [X] I have provided all relevant details, including the specific game and Docker images I am using if this issue is related to running a server.
  • [X] I have checked in the Discord server and believe this is a bug with the software, and not a configuration issue with my specific system.

OpenSource03 avatar Feb 02 '22 20:02 OpenSource03

Are you using Cloudflare on the Panel? Could be an explanation for the 502 errors you're receiving.

camw0 avatar Feb 02 '22 21:02 camw0

This is easy to reproduce. If I have a panel running (behind cloudflare) with 1 node up and 1 node down I will get 504 errors on 50% of refreshes hitting any panel URL.

I personally experienced this on 0.7 but have not tried to replicate it on 1.x. Based on reports in Discord (and here) I'm fairly sure it still exists.

iamkubi avatar Feb 03 '22 00:02 iamkubi

Are you using Cloudflare on the Panel? Could be an explanation for the 502 errors you're receiving.

I am, forgot to mention. I updated my initial comment.

OpenSource03 avatar Feb 03 '22 06:02 OpenSource03

Are you sure this isn't Cloudflare just assuming it is down because of the 504s? I explicitly return a 504 from the API endpoints that reach out to Wings, so if Cloudflare is seeing that and calling the whole site down as a result, thats an issue with Cloudflare.

504 Gateway Timeout The server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the request.

DaneEveritt avatar Feb 03 '22 13:02 DaneEveritt

Are you sure this isn't Cloudflare just assuming it is down because of the 504s? I explicitly return a 504 from the API endpoints that reach out to Wings, so if Cloudflare is seeing that and calling the whole site down as a result, thats an issue with Cloudflare.

504 Gateway Timeout The server, while acting as a gateway or proxy, did not receive a timely response from an upstream server it needed to access in order to complete the request.

You can't just say "it's an issue with CloudFlare". Cloudflare is used by millions of websites, and just moving forward like it was nothing is not a great option. On top of periodic error pages, the panel can also freeze or get extremely slow, some servers may continue to load forever, even if not on that node etc etc etc.

I'm sure there can be a solution to this issue that would work with CloudFlare and any other CDN whatsoever.

OpenSource03 avatar Feb 03 '22 13:02 OpenSource03

This might be related to https://github.com/pterodactyl/panel/commit/4964d294f6b23208a00a9f153b616632dd5acebd. This was a commit from 0.7 though so I do not know if the code is still present in the latest version.

mrxbox98 avatar Feb 03 '22 20:02 mrxbox98

I'd say it's somewhere in

/var/www/pterodactyl/app/Repositories/Wings/DaemonServerRepository.php (suspension function)

Likely due to undefined timeout in GuzzleHttp Client. Adding a timeout did help a little, but more work needs to be done. This, of course, only made the panel a little smoother while a node was down.

OpenSource03 avatar Feb 03 '22 20:02 OpenSource03

Can you expand on what you mean about an undefined timeout, they're set here: https://github.com/pterodactyl/panel/blob/develop/app/Repositories/Wings/DaemonRepository.php#L72-L73

Can you please also include far more details about "panel slows down", because there isn't much I can do to debug that sentence or guide you towards better resources/ideas to try.

DaneEveritt avatar Feb 03 '22 20:02 DaneEveritt

Can you expand on what you mean about an undefined timeout, they're set here: https://github.com/pterodactyl/panel/blob/develop/app/Repositories/Wings/DaemonRepository.php#L72-L73

Oh, didn't see that, but it explains the issue. The default timeout is 30 from config. I had success even with 1.5 but didn't test it in production. Maybe it could produce issues via such a small timeout. I've implemented timeout of 1.5 in DaemonServerRepository suspend function but didn't know that there is one in getHttpClient(). Nonetheless, it made the panel a little smoother in case of a node being down, but it was still slower, unresponsive, and with periodic lovely 502 errors popping up.

Can you please also include far more details about "panel slows down", because there isn't much I can do to debug that sentence or guide you towards better resources/ideas to try.

Alright so here's the video: https://youtu.be/YEnfterMMx8

  • At 0:26 we've stopped Wings.
  • At the same time, we applied a rule iptables -A INPUT -p tcp --dport 8080 -j DROP. This is to simulate that the system is completely down. Should the host system be reachable, we'll only get 502, and no slowdowns.
  • At 0:37 I've attempted a refresh on the page and experienced an extraordinarily long wait until it eventually loaded
  • I attempted that a few more times after, same result

OpenSource03 avatar Feb 03 '22 20:02 OpenSource03

I've tried to reproduce this and cannot, I really don't know what to tell you here.

I have my entire node offline, and everything continues working exactly as I expect it should, minus the /resources API calls that return the expected 504 errors from the API but that is simply a front-end issue.

I believe this might be an issue with your php-fpm config. I can reproduce similar results only when flooding the API endpoint with a ton of requests at once (100+ parallel on my dev environment) and then trying to do things, in which case PHP is busy working on fetching the API results.

DaneEveritt avatar Feb 05 '22 17:02 DaneEveritt

I've tried to reproduce this and cannot, I really don't know what to tell you here.

I have my entire node offline, and everything continues working exactly as I expect it should, minus the /resources API calls that return the expected 504 errors from the API but that is simply a front-end issue.

I believe this might be an issue with your php-fpm config. I can reproduce similar results only when flooding the API endpoint with a ton of requests at once (100+ parallel on my dev environment) and then trying to do things, in which case PHP is busy working on fetching the API results.

Did you connect it to CloudFlare first?

I’m pretty sure it’s not due to Cloudflare config. Every time I ever installed this panel on any machine, I got the same behavior. The config is the default one. We’re currently running the panel as the only software on a whole 32GB 4 core 8 threads OVH rise baremetal.

OpenSource03 avatar Feb 05 '22 18:02 OpenSource03

Can confirm that I have run into similar problems as @OpenSource03 I also use Cloudflare. When wings is unresponsive for one node. Then the panel becomes really slow.

Wepwawetcode avatar Feb 06 '22 18:02 Wepwawetcode

My current scenario with long loading times looks like OP's, but only to some extents.

I tend to host my instances of Panel and Wings using the latest Docker images and things tend to work flawlessly—but something has changed somewhere which has introduced this slow page loading speed.

I first noticed this speed issue in a fresh installation of Panel v1.7.0. To troubleshoot, I went ahead and removed the container and cleared the database to try the image for Panel v1.6.6—but same ordeal there. One thing to note, these new instances of the Panel had yet to see any node attached to them. So, the "If one of nodes go offline, the panel will slow down." doesn't quite apply here as far as my testing has gone for the Docker-variant of the Panel.

When I tried to check if it was a bottleneck by high resource usage or an overload on the network, I couldn't find anything out of the ordinary. Looking at the system resource usage, they were low as expected on such a fresh installation, and the bandwidth usage was basically a zero.

The next step was to see if the issue persisted if I did the installation natively (a.k.a. not via Docker), and surprisingly, the issue was no longer there. Same system hardware, same host Linux distro, and same Cloudflare config. No matter if there was a node created or not, or if the created node was online or offline, the panel was responsive just like I had experienced previously when hosting Panel v1.6.6 via Docker about a month ago.

Both me and OP seem to proxy our Panel installations via Cloudflare. I can confirm that my configuration for my domain over at Cloudflare has not changed in the past few months (except DNS records), so it shouldn't be anything on Cloudflare's end—unless they've gone ahead and slowly rolled out changes to customers without making any sort of announcement of said change.

Trying to run the Panel both proxied and as "DNS only" via Cloudflare shows no difference for me at least—the issue is still there. This is something I tried both when doing the installation via Docker and the natively on the host system itself.

whallin avatar Feb 07 '22 01:02 whallin

@DaneEveritt any news on this case?

OpenSource03 avatar Feb 28 '22 19:02 OpenSource03

I believe based on some testing as provided above, I am having these issues as well. Currently I have one node down (planned) and my panel has come to a crawl. Earlier this week I was experiencing the same thing but after my other node that was down (planned) came back, it worked flawlessly again.

Valiantiam avatar Mar 16 '22 01:03 Valiantiam

This applies to me. Panel is behind Cloudflare. Nodes are accessible directly. Though I don't think that it is Cloudflare's issue as I was experiencing this behaviour when I ran panel w/o CF a couple of versions ago

WGOS avatar Apr 08 '22 22:04 WGOS

This happens to me even when everything is not on CloudFlare, and the offline node has a lot of servers and you've toggled the view all servers button.

babyharpsea avatar May 25 '22 21:05 babyharpsea

Any updates on this? I don't have the option of not using cloudflare due to frequent DDoSing, and this bug makes ptero neigh unusable whenever I forget to renew a cert on one of my boxes.

sapphonie avatar Feb 15 '23 04:02 sapphonie

Any updates on this? I don't have the option of not using cloudflare due to frequent DDoSing, and this bug makes ptero neigh unusable whenever I forget to renew a cert on one of my boxes.

While not able to respond on the specific question, Highly recommend setting up certbot with autorenew for your certs so you don't have to manually renew them each time.

Valiantiam avatar Feb 15 '23 04:02 Valiantiam

Any updates on this?

Wepwawetcode avatar Aug 08 '23 14:08 Wepwawetcode

Same problem here, using Cloudflare too

Maxence1502 avatar Aug 23 '23 17:08 Maxence1502