Practice-Bot icon indicating copy to clipboard operation
Practice-Bot copied to clipboard

[BUG] AWS CPU Usage Causes Bot Offline

Open kevinjycui opened this issue 4 years ago • 7 comments

Have yet to diagnose reason, but the bot periodically uses >90% of AWS EC2 Instance CPU (about once every week) causing the instance to crash and bot to go offline until manually rebooted. Nothing is logged as the bot simply just uses too much CPU. Removing HTTP requests seems to make this time between crashes longer but doing so removes key features and only delays the crash.

Crashes from last 2 months

  • Sunday 03 January, 2021 15:38:15 UTC; 98.9% CPU Usage
  • Thursday 07 January, 2021 17:23:15 UTC; 99.2% CPU Usage
  • Monday 18 January, 2021 10:03:15 UTC; 92.4% CPU Usage
  • Monday 25 January, 2021 14:18:15 UTC; 99.2% CPU Usage
  • Wednesday 03 February, 2021 08:08:15 UTC; 99.7% CPU Usage
  • Tuesday 09 February, 2021 08:08:15 UTC; 99.3% CPU Usage

AWS CloudWatch CPU Usage Screenshot

Screenshot from 2021-02-11 14-14-31

Due to the privacy policy I have not logged any exact commands executed before each crash as doing so would require me to log every command executed by the bot at all times, so I am unsure if it is a certain command error causing these crashes. If the bot goes offline after running a command or event, please report it here.

kevinjycui avatar Feb 11 '21 19:02 kevinjycui

Downtime has notably increased. Will be attempting to implement sharding to fix this.

EDIT: Running sharded bot on production beta. Will see if this resolves issue.

kevinjycui avatar Mar 23 '21 20:03 kevinjycui

How is the bot process being executed?

My first thought was to add a service monitor so the process gets rebooted automatically after the crash. At least to decrease the downtime while the issue is investigated

orendon avatar Apr 28 '21 16:04 orendon

@orendon It's just a Python script that is being run in a tmux window

kevinjycui avatar May 06 '21 13:05 kevinjycui

@kevinjycui Here is a systemd example that I made for another bot https://gist.github.com/orendon/a34d60e6fbe96e5433f60aeb28c9987c

Also you can check into this post for further details https://ma.ttias.be/auto-restart-crashed-service-systemd/

orendon avatar May 12 '21 15:05 orendon

That could be a nice work-around. I tried implementing it a few days back but then it crashed again a few days after. Will look into this further.

kevinjycui avatar May 15 '21 20:05 kevinjycui

@kevinjycui did the systemd approach worked? willing to help on this if you consider it appropriate

orendon avatar Jun 05 '21 15:06 orendon

@orendon It seems to not have worked since it crashed again a few days ago. It seems like the service file got deleted, so I put it back

# /etc/systemd/system/practice.service

[Unit]
Description=Practice-bot

[Service]
ExecStart=/home/kevin/Practice-Bot/run.sh
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=default.target

kevinjycui avatar Jun 06 '21 15:06 kevinjycui