TeddyCloud Service Fails to Start Sporadically After System Reboot
Describe the bug After a system reboot, the TeddyCloud service does not start correctly sporadically. While the LXC container and the Docker container appear to be running, TeddyCloud is inaccessible via the browser. Additionally, the CPU load in Proxmox increases to ~20%. A manual restart of the LXC container resolves the issue.
This issue was only noticed because the system needs to restart regularly due to the scheduled backup process.
To Reproduce Steps to reproduce the behavior:
1. Reboot the Proxmox host or restart the LXC container.
2. The LXC container starts, and Docker reports that TeddyCloud is running.
3. Attempt to access TeddyCloud via the browser.
4. TeddyCloud is inaccessible.
5. Observe that CPU load in Proxmox increases to ~20%.
6. Restart the LXC container manually.
7. TeddyCloud becomes accessible again.
Expected behavior TeddyCloud should start correctly and be accessible after a system reboot or container restart, without requiring manual intervention.
Screenshots N/A
Technical Details:
- TeddyCloud Version: v0.6.3
- Proxmox PVE Version: 8.3.3
- Proxmox Backup Server (PBS) Version: 3.3.2
- LXC Container: Docker LXC (Installed via HelperScript)
- Docker Version: 27.4.1
- Docker image used: ghcr.io/toniebox-reverse-engineering/teddycloud:lates
Attach logs of teddyCloud
Additional context
- The issue is not caused by the Proxmox backup process itself.
- The backup is scheduled for 08:00 AM, but logs show a 1-hour offset (logged at 07:00 AM).
- The problem occurs sporadically after a system reboot.
- The issue was only noticed because the system needs to restart regularly due to the scheduled backup process.
This also happens once in a while on my test system. The reason is unknown, as teddyCloud doesn't start at all, there is no log output in that case.
Since this condition occurs every 2-5 days, I would like to narrow down the error further. I suspect that the teddycloud service is the problem, because Proxmox itself and the Docker LXC start up completely. Portainer in Docker LXC is also accessible. The teddycloud Docker container is also running according to Portainer.
As soon as I log into the shell of the container and start teddycloud manually, the service runs normally. This means that only teddycloud does not start automatically. Therefore, I don't see any indication of a proxmox problem.
I need help to narrow down the error further. What debugging options are available in the teddycloud docker container?
As there is no log output of teddycloud, it could only happen here:
https://github.com/toniebox-reverse-engineering/teddycloud/blob/93b4ecd6a0c546d27c1ba487ede9aa01a9038274/src/main.c#L235
int_t main(int argc, char *argv[])
{
char cwd[PATH_LEN] = {0};
error_text_init();
get_settings()->log.level = TRACE_LEVEL_WARNING;
TRACE_PRINTF(BUILD_FULL_NAME_LONG "\r\n\r\n");
error_text_init is just filling an array with data, get_settings also just writes data to existing structures.
On teddycloud, there is nothing which could help debugging directly. It may be possible to add a DEBUG ENV variable and enable strace or something similar to see what the system does.
I have changed the docker-entrypoint.sh from teddycloud docker container. Maybe I can get more information about this?
#!/bin/bash
set -o errexit
set -o pipefail
set -o nounset
set -o xtrace # Enable command tracing for debugging
mkdir -p /teddycloud/certs/server /teddycloud/certs/client
if [ -n "${DOCKER_TEST:-}" ]; then
cd /teddycloud
LSAN_OPTIONS=detect_leaks=0 teddycloud --docker-test
else
while true
do
cd /teddycloud
echo "Starting teddycloud at $(date)"
teddycloud
retVal=$?
echo "teddycloud exited with code $retVal at $(date)"
# Optional: Add a short delay before restarting
sleep 5
if [ $retVal -ne -2 ]; then
exit $retVal
fi
done
fi
I have added it to the latest develop
Additionally environment: STRACE: 1
enables strace
it is also possible to attach strace to a running process, if this happens next time with this build. (without the env variable set)
I got these logs info, not more (v0.6.3):
2025-03-17T07:01:11.510674000Z + mkdir -p /teddycloud/certs/server /teddycloud/certs/client
2025-03-17T07:01:11.512045000Z + '[' -n '' ']'
2025-03-17T07:01:11.512158000Z + true
2025-03-17T07:01:11.512223000Z + cd /teddycloud
2025-03-17T07:01:11.512390000Z ++ date
2025-03-17T07:01:11.513250000Z Starting teddycloud at Mon Mar 17 07:01:11 UTC 2025
2025-03-17T07:01:11.513326000Z + echo 'Starting teddycloud at Mon Mar 17 07:01:11 UTC 2025'
2025-03-17T07:01:11.513404000Z + teddycloud
I got these logs info, not more (v0.6.3)
Can you also enable strace?
I also get this error since v0.6.3 after rebooting, but on a usual archlinux with docker. Also with v0.6.4 the error still occurs. There are no logs and the CPU runs with almost 100% in one thread.
I got these logs info, not more (v0.6.3)
Can you also enable strace?
How?
You need to install strace in the container and then adjust your teddycloud call in your script to LSAN_OPTIONS=detect_leaks=0 strace teddycloud --docker-test
strace generates a lot of output. During normal run of teddycloud this might impact the performance a lot
You don't need to install a trace, it is already available since 0.6.4 within the container. You could also enable strace with the strace: 1 ENV variable.
JFYI: I have not experienced this issue in a while.