portainer icon indicating copy to clipboard operation
portainer copied to clipboard

Portainer agent down after 2.20.1 upgrade

Open v1rusnl opened this issue 10 months ago • 11 comments

Before you start please confirm the following.

Problem Description

2 Devices

  • Openmediavault with the main Portainer instance (Debian 11; 192.168.2.54)
  • Armbian device with Portainer Agent (Armbian 24.2.1 (Debian 12); 192.168.2.52)

After I upgraded both instances to 2.20.1, the portainer-agent environment is shown "down" in the WebUI. When I click on the environment, portainer loads way longer than usual, but finally opens the agent environment. The environment overview shows it as "up". After leaving the page for several minutes and going back to it, the agent is "down" again.

image

No settings changed on the systems, just an update of Portainer and the agent to sts (2.20.1).

Expected Behavior

Open the Portainer WebUI and having agent environment status "up".

Actual Behavior

Open the Portainer WebUI and having agent environment status "down".

Steps to Reproduce

  1. Open Portainer WebUI -> Agent environment down
  2. Click on Agent environment -> load time is way longer than usual -> agent environment appears as up
  3. Leave WebUI for let's sys half an hour -> agent environment down

Portainer logs or screenshots

Portainer Main: 2024/04/06 10:22PM INF portainer/main.go:823 > starting Portainer | build_number=61 go_version=go1.21.6 image_tag=linux-amd64-2.20.1 nodejs_version=18.20.0 version=2.20.1 webpack_version=5.88.2 yarn_version=1.22.22 2024/04/06 10:22PM INF http/server.go:463 > starting HTTPS server | bind_address=:9443 2024/04/06 10:22PM INF http/server.go:447 > starting HTTP server | bind_address=:9000 ... 2024/04/06 10:47PM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/06 10:57PM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/06 11:07PM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/06 11:17PM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/06 11:37PM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/06 11:47PM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/06 11:57PM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 12:07AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 12:27AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 12:42AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 01:02AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 01:12AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 01:27AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 06:32AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 06:42AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 07:02AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 07:12AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 07:22AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 07:32AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 07:42AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 07:52AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 08:07AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 08:27AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 08:37AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 08:47AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 08:57AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 09:07AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 09:17AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 09:32AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 09:47AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 10:02AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 10:12AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 10:22AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 10:37AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 10:52AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 11:02AM ERR snapshot/snapshot.go:313 > background schedule error (environment snapshot), unable to execute pending actions | error="environment \"labor\" (id: 3) is not up" 2024/04/07 11:13AM WRN docker/snapshot.go:93 > unable to snapshot engine version | error="Cannot connect to the Docker daemon at tcp://192.168.2.52:9001. Is the docker daemon running?" environment=labor

Agent environment: 2024/04/07 09:28AM INF ./main.go:86 > agent running on Docker platform 2024/04/07 09:28AM FTL ./main.go:92 > unable to retrieve information from Docker | error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" 2024/04/07 09:28AM INF ./main.go:86 > agent running on Docker platform 2024/04/07 09:28AM FTL ./main.go:92 > unable to retrieve information from Docker | error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" 2024/04/07 09:29AM INF ./main.go:86 > agent running on Docker platform 2024/04/07 09:29AM FTL ./main.go:92 > unable to retrieve information from Docker | error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" 2024/04/07 09:29AM INF ./main.go:86 > agent running on Docker platform 2024/04/07 09:29AM FTL ./main.go:92 > unable to retrieve information from Docker | error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" 2024/04/07 09:29AM INF ./main.go:86 > agent running on Docker platform 2024/04/07 09:29AM FTL ./main.go:92 > unable to retrieve information from Docker | error="Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?" 2024/04/07 09:33AM INF ./main.go:86 > agent running on Docker platform 2024/04/07 09:33AM INF registry/server.go:101 > starting registry credential server 2024/04/07 09:33AM INF http/server.go:99 > starting Agent API server | api_version=2.20.1 server_addr=0.0.0.0 server_port=9001 use_tls=true 2024/04/07 09:37:11 http: TLS handshake error from 192.168.2.54:56224: EOF 2024/04/07 09:37:11 http: TLS handshake error from 192.168.2.54:56190: EOF 2024/04/07 09:37:11 http: TLS handshake error from 192.168.2.54:56218: EOF 2024/04/07 09:38:33 http: TLS handshake error from 192.168.2.54:58672: EOF 2024/04/07 09:39:59 http: TLS handshake error from 192.168.2.54:58282: EOF 2024/04/07 09:39:59 http: TLS handshake error from 192.168.2.54:58264: EOF 2024/04/07 09:40:27 http: TLS handshake error from 192.168.2.54:33830: EOF 2024/04/07 09:40:27 http: TLS handshake error from 192.168.2.54:33840: EOF 2024/04/07 09:42:14 http: TLS handshake error from 192.168.2.54:45652: EOF 2024/04/07 10:02:20 http: TLS handshake error from 192.168.2.54:49548: EOF 2024/04/07 10:02:20 http: TLS handshake error from 192.168.2.54:49554: EOF

Docker version on both devices: `Client: Docker Engine - Community Version: 26.0.0 API version: 1.45 Go version: go1.21.8 Git commit: 2ae903e Built: Wed Mar 20 15:18:25 2024 OS/Arch: linux/arm Context: default

Server: Docker Engine - Community Engine: Version: 26.0.0 API version: 1.45 (minimum version 1.24) Go version: go1.21.8 Git commit: 8b79278 Built: Wed Mar 20 15:18:25 2024 OS/Arch: linux/arm Experimental: false containerd: Version: 1.6.28 GitCommit: ae07eda36dd25f8a1b98dfbf587313b99c0190bb runc: Version: 1.1.12 GitCommit: v1.1.12-0-g51d5e94 docker-init: Version: 0.19.0 GitCommit: de40ad0`

Portainer version

2.20.1

Portainer Edition

Business Edition (BE/EE) with 5NF / 3NF license

Platform and Version

Docker 26.0.0 | see in log section

OS and Architecture

Main Portainer instance (Debian 11; AMD64) - Portainer Agent (Armbian 24.2.1 (Debian 12); ARM/v7)

Browser

Vivaldi 6.6.3271.57

What command did you use to deploy Portainer?

Portainer Main:

services:
  portainer:
    image: portainer/portainer-ee:sts
    container_name: portainer
    restart: unless-stopped
    security_opt:
      - no-new-privileges:true
    volumes:
      - /etc/localtime:/etc/localtime:ro
      - /var/run/docker.sock:/var/run/docker.sock
      - /opt/tools/portainer/data:/data
    ports:
      - 9000:9000
      - 9443:9443


Portainer Agent:

docker run -d \
  -p 9001:9001 \
  --name portainer_agent \
  --restart=always \
  -v /var/run/docker.sock:/var/run/docker.sock \
  -v /var/lib/docker/volumes:/var/lib/docker/volumes \
  portainer/agent:sts

Additional Information

No response

v1rusnl avatar Apr 07 '24 10:04 v1rusnl

Having the same issue and moved back to the previous version. :(

tobidemski avatar Apr 08 '24 22:04 tobidemski

Having the same issue and moved back to the previous version. :(

What previous version did you move back to?

jamescarppe avatar Apr 09 '24 03:04 jamescarppe

Rolling back Agent did not solve it for me unfortunately.

v1rusnl avatar Apr 09 '24 04:04 v1rusnl

I rollbacked the portainer instance from 2.20.1 to 2.19.4. I don't use remote agents but use them via sockets and had the same issues.

tobidemski avatar Apr 09 '24 17:04 tobidemski

I rollbacked the portainer instance from 2.20.1 to 2.19.4. I don't use remote agents but use them via sockets and had the same issues.

Dont't know if there were changes under the hood, but a simple rollback to 2.19.4 on the main instance did also not fix the issue for me. Posting here was my last shot.

v1rusnl avatar Apr 09 '24 17:04 v1rusnl

We're not seeing this with most users, so it's likely there's something unique about the setups of those of you that are experiencing this. Are you perhaps able to give us a bit more detail around your environments so we can see if there's some common factors at play? Things like operating systems and versions, storage configurations (local storage vs NFS or similar), OS security measures (for example AppArmor / SElinux), etc would be helpful. The more info we can get the more likely we'll be able to figure out a cause.

jamescarppe avatar Apr 11 '24 23:04 jamescarppe

Hello,

I'm also having this issue. OS Ubuntu 22.04.4 LTS (arm64) Running agent v 2.19.4 Docker version 25.0.5 Portainer versions BE 2.19.4

My stderr logs are as follows

stderrٖѓ2024/04/11 02:55PM INF github.com/portainer/agent/http/server.go:99 > starting Agent API server | api_version=2.19.4 server_addr=0.0.0.0 server_port=9001 use_tls=truez
stderrf2024/04/11 15:06:08 http: TLS handshake error from 172.16.16.6:51012: local error: tls: bad record MACz]
stderr턭I2024/04/11 15:06:24 http: TLS handshake error from 172.16.16.6:58774: EOF]]
stderr뎞I2024/04/11 15:06:24 http: TLS handshake error from 172.16.16.6:58786: EOF]]
stderr˖I2024/04/11 15:06:36 http: TLS handshake error from 172.16.16.6:35838: EOF]]
stderrקI2024/04/11 15:06:42 http: TLS handshake error from 172.16.16.6:40822: EOF]]
stderrI2024/04/11 15:06:46 http: TLS handshake error from 172.16.16.6:40868: EOF]]
stderrI2024/04/11 15:07:21 http: TLS handshake error from 172.16.16.6:58656: EOF]]
stderrI2024/04/11 15:07:21 http: TLS handshake error from 172.16.16.6:58652: EOF]]
stderrI2024/04/11 15:07:21 http: TLS handshake error from 172.16.16.6:58676: EOF]]
stderrώ݋I2024/04/11 15:07:24 http: TLS handshake error from 172.16.16.6:58698: EOF]]
stderrI2024/04/11 15:09:37 http: TLS handshake error from 172.16.16.6:38264: EOF]]
stderrI2024/04/11 15:09:52 http: TLS handshake error from 172.16.16.6:50628: EOF]]
stderr蓖I2024/04/11 15:09:59 http: TLS handshake error from 172.16.16.6:34136: EOF]]
stderrߖI2024/04/11 15:09:59 http: TLS handshake error from 172.16.16.6:34134: EOF]]
stderrI2024/04/11 15:10:11 http: TLS handshake error from 172.16.16.6:40596: EOF]]
stderrI2024/04/11 15:10:11 http: TLS handshake error from 172.16.16.6:40598: EOF]]
stderrϝI2024/04/11 15:10:55 http: TLS handshake error from 172.16.16.6:55830: EOF]]
stderrÐI2024/04/11 15:11:05 http: TLS handshake error from 172.16.16.6:54146: EOF]]
stderrߒI2024/04/11 15:11:05 http: TLS handshake error from 172.16.16.6:54162: EOF]]
stderrʷI2024/04/11 15:11:05 http: TLS handshake error from 172.16.16.6:54172: EOF]
stderr˷2024/04/11 15:11:05 http: TLS handshake error from 172.16.16.6:54168: read tcp 172.17.0.2:9001->172.16.16.6:54168: read: connection reset by peer]
stderrI2024/04/11 15:11:07 http: TLS handshake error from 172.16.16.6:54174: EOF]w
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index1/size: no such file or directoryww
stderr󑠩 cWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index1/size: no such file or directoryww
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index2/size: no such file or directoryw
stderrWARNING: failed to determine memory area for node: open /host/sys/devices/system/node/node0/memory_failure/state: no such file or directory]
stderrI2024/04/11 15:11:14 http: TLS handshake error from 172.16.16.6:41600: EOF]]
stderrI2024/04/11 15:11:14 http: TLS handshake error from 172.16.16.6:41610: EOF]]
stderr꧹I2024/04/11 15:11:14 http: TLS handshake error from 172.16.16.6:41614: EOF]w
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index1/size: no such file or directoryww
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index1/size: no such file or directoryww
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index2/size: no such file or directoryw
stderrWARNING: failed to determine memory area for node: open /host/sys/devices/system/node/node0/memory_failure/state: no such file or directory]
stderrI2024/04/11 15:11:29 http: TLS handshake error from 172.16.16.6:37482: EOF]w
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index1/size: no such file or directoryww
stderrécWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index1/size: no such file or directoryww
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index2/size: no such file or directoryw
stderrƫWARNING: failed to determine memory area for node: open /host/sys/devices/system/node/node0/memory_failure/state: no such file or directory]
stderrкI2024/04/11 15:11:56 http: TLS handshake error from 172.16.16.6:41524: EOF]]
stderrҤI2024/04/11 15:11:56 http: TLS handshake error from 172.16.16.6:41534: EOF]]
stderrI2024/04/11 15:11:56 http: TLS handshake error from 172.16.16.6:41548: EOF]]
stderrI2024/04/11 15:11:56 http: TLS handshake error from 172.16.16.6:41550: EOF]]
stderrI2024/04/11 15:32:18 http: TLS handshake error from 172.16.16.6:44942: EOF]]
stderrI2024/04/11 15:32:18 http: TLS handshake error from 172.16.16.6:44956: EOF]]
stderrػI2024/04/11 15:32:18 http: TLS handshake error from 172.16.16.6:44968: EOF]]
stderrԈI2024/04/11 15:32:27 http: TLS handshake error from 172.16.16.6:50406: EOF]w
stderr뉫cWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index1/size: no such file or directoryww
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index1/size: no such file or directoryww
stderrcWARNING: open /host/sys/devices/system/node/node0/cpu0/cache/index2/size: no such file or directoryw
stderrᥭWARNING: failed to determine memory area for node: open /host/sys/devices/system/node/node0/memory_failure/state: no such file or directory]
stderr؊I2024/04/11 15:32:36 http: TLS handshake error from 172.16.16.6:36466: EOF]]
stderrӊڊI2024/04/11 15:32:36 http: TLS handshake error from 172.16.16.6:36474: EOF]]
stderrI2024/04/11 15:32:36 http: TLS handshake error from 172.16.16.6:36484: EOF]]
stderr󈣫 I2024/04/11 15:32:45 http: TLS handshake error from 172.16.16.6:60244: EOF]]
stderrަI2024/04/11 15:32:45 http: TLS handshake error from 172.16.16.6:60250: EOF]

stefanbulof avatar Apr 12 '24 13:04 stefanbulof

@jamescarppe I'm having the same problem as above.

Upgrading from ":latest" to ":sts" broke Portainer's connection to remote servers.

They appear as "Down". Well, they seem to be "Up" now, but two of the servers are getting an error of "Failure - Unable to retrieve volumes". And not all the containers are appearing. After some time, they're all "Down" again...

One of the servers is CentOS 7 and the other one is AlmaLinux 8. All users (whether working or not) have Docker 24.0.7.

Happens with both 2.20.1 and 2.20.0.

I tried to downgrade back to ":latest", but looks like downgrade is not possible:

The database schema version does not align with the server version. Please consider reverting to the previous server version or addressing the database migration issue.

(ok - I was able to downgrade by downloading a backup file, downgrade to ":latest" and upload the backup file)

ghnp5 avatar Apr 13 '24 22:04 ghnp5

I can confirm it, I cannot use the portainer-ee latest/lts version after I'm using portainer-ee:2.20.1.

The database schema version does not align with the server version. Please consider reverting to the previous server version or addressing the database migration issue.

Feriman22 avatar Apr 25 '24 10:04 Feriman22

Same issue ...

mikekuzak avatar Apr 30 '24 14:04 mikekuzak

Same issue here. The Portainer Agent container is up and running, but Portainer is unable to connect to it. "Environment is unreachable"

Portainer Agent container log:


2024/05/10 07:39PM INF ./main.go:86 > agent running on Docker platform 
2024/05/10 07:39PM WRN ./main.go:112 > unable to retrieve agent container IP address, using host flag instead | error="Error response from daemon: No such container: bb03d4935cd2" host_flag=0.0.0.0 
2024/05/10 07:39PM INF registry/server.go:101 > starting registry credential server 
2024/05/10 07:39PM INF http/server.go:99 > starting Agent API server | api_version=2.19.5 server_addr=0.0.0.0 server_port=9001 use_tls=true 
2024/05/10 07:40PM INF ./main.go:86 > agent running on Docker platform 
2024/05/10 07:40PM WRN ./main.go:112 > unable to retrieve agent container IP address, using host flag instead | error="Error response from daemon: No such container: bb03d4935cd2" host_flag=0.0.0.0 
2024/05/10 07:40PM INF registry/server.go:101 > starting registry credential server 
2024/05/10 07:40PM INF http/server.go:99 > starting Agent API server | api_version=2.19.5 server_addr=0.0.0.0 server_port=9001 use_tls=true 
2024/05/10 07:41PM INF ./main.go:86 > agent running on Docker platform 
2024/05/10 07:41PM WRN ./main.go:112 > unable to retrieve agent container IP address, using host flag instead | error="Error response from daemon: No such container: bb03d4935cd2" host_flag=0.0.0.0 
2024/05/10 07:41PM INF registry/server.go:101 > starting registry credential server 
2024/05/10 07:41PM INF http/server.go:99 > starting Agent API server | api_version=2.19.5 server_addr=0.0.0.0 server_port=9001 use_tls=true 

Matthias-vdE avatar May 10 '24 19:05 Matthias-vdE

After upgrading all the hosts to AlmaLinux 8 (some were on CentOS 7), and all to Docker 26.1.2, the latest version of portainer:sts seems to work perfectly now.

ghnp5 avatar May 21 '24 21:05 ghnp5

@ghnp5 I can confirm it, it works with 2.20.3 perfectly.

Feriman22 avatar May 22 '24 06:05 Feriman22

Same here, Docker updated to 26.1.3, Portainer EE and agents to 2.20.3, everything back to normal.

v1rusnl avatar May 30 '24 12:05 v1rusnl