all-in-one icon indicating copy to clipboard operation
all-in-one copied to clipboard

AIO Linode becomes unresponsive after a few days

Open MariusQuabeck opened this issue 1 year ago • 18 comments

Every ~10 days, my Nextcloud instance becomes unresponsive

  • Webinterface wont load
  • calendar sync stops working
  • even on ssh, login hangs and times out
  • AIO:8443 login prompt works but never finishes loading after submitting password

I literally can't access the vm or logs without rebooting the VM from Linodes web interface, so I'm not exactly sure how/where to get logs for this. A forced reboot from Linodes web interface solves the issue for at least a few days

this is a AIO install from their marketplace/store, no custom changes AFAIK

here is what Linode analytics look like while the issue occurs image

in this instance, AIO has been unresponsive for almost 3 days before I've noticed image

MariusQuabeck avatar Feb 05 '24 17:02 MariusQuabeck

Hi Marius, hope you are doing fine? 👋

Can you please post the output of sudo docker info here? :)

szaimen avatar Feb 05 '24 20:02 szaimen

I'm doing great, thank you. I hope we'll run into each other at some event soon

sudo docker info

Client: Docker Engine - Community
 Version:    24.0.7
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.11.2
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.21.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose

Server:
 Containers: 11
  Running: 9
  Paused: 0
  Stopped: 2
 Images: 12
 Server Version: 24.0.7
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 3dd1e886e55dd695541fdcd67420c2888645a495
 runc version: v1.1.10-0-g18a0cb0
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-92-generic
 Operating System: Ubuntu 22.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 3.82GiB
 Name: REDACTED.ip.linodeusercontent.com
 ID: REDACTED
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

MariusQuabeck avatar Feb 06 '24 15:02 MariusQuabeck

I'm doing great, thank you. I hope we'll run into each other at some event soon

Great to hear! Hope so as well :)

szaimen avatar Feb 07 '24 11:02 szaimen

Regarding your issue, can you please update docker via sudo apt update; sudo apt upgrade; sudo reboot and check if that improves things? I fear the issue is caused by docker and not by AIO 😢

szaimen avatar Feb 07 '24 11:02 szaimen

@MariusQuabeck @szaimen

I have been having the same issue for the last few months (Nextcloud non-responsive after approx. 13 days uptime).

I have documented my torturous journey here at Nextcloud help forum. @

I have had no support from Nextcloud team or recognition of this issue affecting others apart from you, but I was sure I wasn't the only one experiencing this out there. So thanks for sharing @MariusQuabeck!

Issue seemed to start around AIO v7.9.1 and continues to present day on AIO v7.11.2, with a totally different setup to you. I'm running on: **Windows 11, with Hyper VM running Ubuntu 22.04.3 LTS, with Datadir pointing to the network location on windows host.

My investigation has shown everything is pointing to root/ncadmin processes stuck on loop with network operations trying to communicate over SMB and using 2-4 cores intensely stopping Nextcloud master container and Ubuntu from communicating internally in private network or externally exposed hostname. There workaround for me was to to keep a ssh connected at the start of the VM start, and keep it open with htop showing. @MariusQuabeck Are you using SMB share as datadir like me?

To note, I have set up as per AIO documentation.

@szaimen I have tried sudo apt update; sudo apt upgrade; sudo reboot the last time AIO became unresponsive on 24th January, experienced this with Ubuntu 22.04.1 LTS and Ubuntu 22.04.3 LTS. No improvements or change, AIO still hanging after approx. 13 days. I suspect this may not be a docker issue per se, but an issue with networking comms using SMB as Datadir. Please have a look at my post here for further detail.

@szaimen Here's my sudo docker info:

Client: Docker Engine - Community
 Version:    25.0.0
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.12.1
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.24.1
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 17
  Running: 14
  Paused: 0
  Stopped: 3
 Images: 19
 Server Version: 25.0.0
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: a1496014c916f9e62104b33d1bb5bd03b0858e59
 runc version: v1.1.11-0-g4bccb38
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-92-generic
 Operating System: Ubuntu 22.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 10.42GiB
 Name: nextcloud
 ID: 934c1cb8-3402-4ccf-ae66-59d51c9cde6c
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

sunnyd24 avatar Feb 07 '24 14:02 sunnyd24

I've installed updates and will report back if this issue keeps occurring. Thanks Simon!

sudo apt list --upgradable 
Listing... Done
base-files/jammy-updates 12ubuntu4.5 amd64 [upgradable from: 12ubuntu4.4]
containerd.io/jammy 1.6.28-1 amd64 [upgradable from: 1.6.26-1]
distro-info-data/jammy-updates 0.52ubuntu0.6 all [upgradable from: 0.52ubuntu0.5]
distro-info/jammy-updates 1.1ubuntu0.2 amd64 [upgradable from: 1.1ubuntu0.1]
docker-buildx-plugin/jammy 0.12.1-1~ubuntu.22.04~jammy amd64 [upgradable from: 0.11.2-1~ubuntu.22.04~jammy]
docker-ce-cli/jammy 5:25.0.3-1~ubuntu.22.04~jammy amd64 [upgradable from: 5:24.0.7-1~ubuntu.22.04~jammy]
docker-ce-rootless-extras/jammy 5:25.0.3-1~ubuntu.22.04~jammy amd64 [upgradable from: 5:24.0.7-1~ubuntu.22.04~jammy]
docker-ce/jammy 5:25.0.3-1~ubuntu.22.04~jammy amd64 [upgradable from: 5:24.0.7-1~ubuntu.22.04~jammy]
docker-compose-plugin/jammy 2.24.5-1~ubuntu.22.04~jammy amd64 [upgradable from: 2.21.0-1~ubuntu.22.04~jammy]
libmm-glib0/jammy-updates 1.20.0-1~ubuntu22.04.3 amd64 [upgradable from: 1.20.0-1~ubuntu22.04.2]
libnss-systemd/jammy-updates 249.11-0ubuntu3.12 amd64 [upgradable from: 249.11-0ubuntu3.11]
libpam-systemd/jammy-updates 249.11-0ubuntu3.12 amd64 [upgradable from: 249.11-0ubuntu3.11]
libsystemd0/jammy-updates 249.11-0ubuntu3.12 amd64 [upgradable from: 249.11-0ubuntu3.11]
libudev1/jammy-updates 249.11-0ubuntu3.12 amd64 [upgradable from: 249.11-0ubuntu3.11]
linux-firmware/jammy-updates 20220329.git681281e4-0ubuntu3.26 all [upgradable from: 20220329.git681281e4-0ubuntu3.23]
modemmanager/jammy-updates 1.20.0-1~ubuntu22.04.3 amd64 [upgradable from: 1.20.0-1~ubuntu22.04.2]
motd-news-config/jammy-updates 12ubuntu4.5 all [upgradable from: 12ubuntu4.4]
python3-distro-info/jammy-updates 1.1ubuntu0.2 all [upgradable from: 1.1ubuntu0.1]
python3-software-properties/jammy-updates 0.99.22.9 all [upgradable from: 0.99.22.8]
python3-update-manager/jammy-updates 1:22.04.18 all [upgradable from: 1:22.04.10]
software-properties-common/jammy-updates 0.99.22.9 all [upgradable from: 0.99.22.8]
systemd-hwe-hwdb/jammy-updates 249.11.5 all [upgradable from: 249.11.4]
systemd-sysv/jammy-updates 249.11-0ubuntu3.12 amd64 [upgradable from: 249.11-0ubuntu3.11]
systemd-timesyncd/jammy-updates 249.11-0ubuntu3.12 amd64 [upgradable from: 249.11-0ubuntu3.11]
systemd/jammy-updates 249.11-0ubuntu3.12 amd64 [upgradable from: 249.11-0ubuntu3.11]
tzdata/jammy-updates 2023d-0ubuntu0.22.04 all [upgradable from: 2023c-0ubuntu0.22.04.2]
udev/jammy-updates 249.11-0ubuntu3.12 amd64 [upgradable from: 249.11-0ubuntu3.11]
update-manager-core/jammy-updates 1:22.04.18 all [upgradable from: 1:22.04.10]

MariusQuabeck avatar Feb 07 '24 14:02 MariusQuabeck

See https://help.nextcloud.com/t/nextcloud-aio-v7-11-2-non-responsive-high-cpu-every-13-days-network-locationon-host-for-ncdata/178768/7 for my analysis on the topic

szaimen avatar Feb 08 '24 21:02 szaimen

@Zoey2936 since docker-init seems to be involved, do you think that setting init: back to false might already fix this?

szaimen avatar Feb 15 '24 17:02 szaimen

I think not, but not sure

Zoey2936 avatar Feb 16 '24 09:02 Zoey2936

We could at least try if it helps...

szaimen avatar Feb 16 '24 10:02 szaimen

I found this: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1890913 so possibly it is in the end a systemd problem...

szaimen avatar Feb 16 '24 11:02 szaimen

If my PR does not help, I might disable /lib/systemd/system/apport-autoreport.service for a test...

szaimen avatar Feb 16 '24 11:02 szaimen

not sure if it helps but looking back at the load analytics, it started happening in August 2023 and has been happening consistently twice a month since.

MariusQuabeck avatar Feb 16 '24 11:02 MariusQuabeck

So it just happened for you again @MariusQuabeck ?

szaimen avatar Feb 16 '24 11:02 szaimen

sorry, I should have been clearer, no it has not happened again yet. I will report back in a month

MariusQuabeck avatar Feb 16 '24 17:02 MariusQuabeck

It just happened again

MariusQuabeck avatar Feb 19 '24 18:02 MariusQuabeck

Hi @MariusQuabeck can you run sudo systemctl disable --now apport-autoreport.service && sudo reboot and check if that helps?

szaimen avatar Feb 20 '24 14:02 szaimen

done, I'll report back in another month or sooner :)

MariusQuabeck avatar Feb 22 '24 11:02 MariusQuabeck

If it should happen again, please add the following line to your root crontabs via sudo crontab -e: 0 2 * * 7 reboot. This will reboot your server once a week on saturdays at 02:00. Make sure that this does not conflict with the daily backup time! Unfortunately I don't see any way how we can fix this in AIO as it is caused by the OS and not by AIO.

szaimen avatar Mar 04 '24 11:03 szaimen