kamal icon indicating copy to clipboard operation
kamal copied to clipboard

observation: docker system dial-stdio processes do not die

Open wdiechmann opened this issue 1 year ago • 8 comments

not sure what is going on - but my observation is a slowly "degenerating system" as I keep deploying; if this is just me (not knowing to be) high as a kite on ethanol, please apologise me wasting your bandwidth 🙏

Symptoms

Either deploys fail - or takes forever - and service response is measurably below par

Diagnostics

root@ubuntu-4gb-hel1-mortimer-1:~# ps ax
...8<...
3479828 ?        Ssl    0:00 docker system dial-stdio
3479855 ?        Ssl    0:00 docker system dial-stdio
3479861 ?        Ss     0:00 sshd: root@notty
3479928 ?        Ssl    0:00 docker system dial-stdio
3479946 ?        Ssl    0:00 buildctl dial-stdio
3480065 ?        Ss     0:00 sshd: root@pts/0
3480118 ?        I      0:00 [kworker/1:1-events]
3480138 pts/0    Ss     0:00 -bash
3480958 ?        I      0:00 [kworker/u4:3-flush-8:0]
3481521 pts/0    R+     0:00 ps ax
root@ubuntu-4gb-hel1-mortimer-1:~# ps ax | grep dial-stdio | wc -l
99
root@ubuntu-4gb-hel1-mortimer-1:~# shutdown -r now
...8<...
root@ubuntu-4gb-hel1-mortimer-1:~# ps ax | grep dial-stdio | wc -l
1

Remediation

I'm barking up the kamal communicates via the npipe helped by docker system dial-stdio tree - suspecting the "remote" process not knowing when to exit so hangs around indefinitely - just a (wild) guess 🤷🏻‍♂️

Somehow signaling the process to 'go die' would perhaps solve the matter - in a perfect world not until the deploy has finished (either exit 0 or exit something) but otherwise after each command --

Reproduction

All I do is kamal env push && kamal deploy - once/twice pr 2hr slot - effectively demanding a reboot every other day

#/config/deploy.yml
    ....8<...

builder:
  remote:
    arch: arm64
    host: ssh://[email protected]

# Deploy to these servers.
servers:
  web:
    hosts:
      - 1.2.3.4
    options:
    ....8<...

ssh:
  user: bob_the_builder

System

it's a rental, what can I say 😉

happy user of Hetzner services

root@ubuntu-4gb-hel1-mortimer-1:~# uname -a
Linux ubuntu-4gb-hel1-mortimer-1 5.15.0-112-generic #122-Ubuntu SMP Thu May 23 07:51:32 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux

and the ruby/rails env is

rails@e8d5d7728a6a:/rails$ bin/rails -v
Rails 8.0.0.alpha
rails@e8d5d7728a6a:/rails$ ruby -v
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [aarch64-linux]

and finally Kamal is

√ bellis % kamal version
1.3.1

wdiechmann avatar Jun 12 '24 11:06 wdiechmann

What are you running on ubuntu-4gb-hel1-mortimer-1? Is it used as the remote builder?

djmb avatar Jun 13 '24 13:06 djmb

It is - and a staging server (following the current litany out of Chicago = solid_queue, Kamal, SQLite and “1 container to rule them all”) 😉- btw: I’m in tears regarding the work put into this making so much developer happiness 🥰 So: a huge thank you to all contributors!!CheersWaltherDen 13. jun. 2024 kl. 15.48 skrev Donal McBreen @.***>: What are you running on ubuntu-4gb-hel1-mortimer-1? Is it used as the remote builder?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

wdiechmann avatar Jun 13 '24 15:06 wdiechmann

I think the docker system dial-stdio processes are related to the connections to your remote builder then. We have seen similar problems with ours. Looks maybe a bit like this - https://forums.docker.com/t/docker-continuously-making-unnecessary-ssh-connections-to-remote-servers/136132?

For now I'd suggest moving the remote builder to it's own server to avoid affecting your app.

djmb avatar Jun 13 '24 16:06 djmb

yup - that’s the ’signature’

Good advice on the “separation of concerns” 😅

Cheers, Walther

Den 13. jun. 2024 kl. 18.11 skrev Donal McBreen @.***>:

I think the docker system dial-stdio processes are related to the connections to your remote builder then. We have seen similar problems with ours. Looks maybe a bit like this - https://forums.docker.com/t/docker-continuously-making-unnecessary-ssh-connections-to-remote-servers/136132?

For now I'd suggest moving the remote builder to it's own server to avoid affecting your app.

— Reply to this email directly, view it on GitHub https://github.com/basecamp/kamal/issues/837#issuecomment-2166123081, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABPFXK7XA4GWAC2FXZYSVDZHHACHAVCNFSM6AAAAABJGEAZ36VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRWGEZDGMBYGE. You are receiving this because you authored the thread.

wdiechmann avatar Jun 13 '24 17:06 wdiechmann

disclaimer: I've not yet enjoyed Kamal 2.0 - and I do build on the (test) host - which may add significantly to the number of open connections

If anyone ends up here either b/c the host dies (well it is less dramatic than that but it runs out of memory so it is more like it ends up in a kind of "coma" -- a kamal infused coma you might say) under the weight of dial-stdio or realizing that the client (your CPU - mac/linux/pc) keeps every single ssh connection used by Kamal open 'till hell freezes, or you show them the kitchen door 😉

This is the clean ups:

# your client
kill -HUP `ps aux | grep 'ConnectTimeout' | awk '{ print $2}'`

# host
kill $(ps ax | grep 'docker system dial-stdio' | grep Ssl | awk '{print $1}')

wdiechmann avatar Oct 24 '24 09:10 wdiechmann

I am having similar problems, although I think there are two separate problems here.

I use kamal 2.1.1 to deploy from my MBP to three low cost Ubuntu servers. The one chosen as the remote builder runs out of memory in something like 18-24 hours and has to be restarted, after which the process repeats. If I manage to get onto the server before it runs out of memory there are around 100 docker system dial-stdio processes. They appear to be created at a rate of four per hour. This is problem 1.

Problem 2 is that there are also lots of processes on the laptop from which I deploy, 56 as I write, of the form: ssh -o ConnectTimeout=30 -l [deploy user] -- [IP of server 1, 2 or 3] docker system dial-stdio All apparently left over from past deployments and easy to clean-up. They are not connected to the processes on the remote build server, if I kill the laptop processes the remote build processes are still created. If I stop the buildkit container the processes are still spawned. Currently I don't know what causes this continual spawning and I just have to kill them manually if I want the server to remain up.

rogermarlow avatar Oct 26 '24 11:10 rogermarlow

Update: the problem is docker buildx running locally. It connects every few minutes to the remote host, not sure why, I don't want a build, but it leaves a docker system dial-stdio process on the remote build host. And it's not just buildx running on my laptop, it is buildx running on the laptops of all the developers working on this project. (docker buildx stop .... does not stop the builder). I have resorted to cronjobs that delete the processes every 30 mins on the client and server.

This was raised in March in a Docker community forum post.

rogermarlow avatar Oct 26 '24 21:10 rogermarlow

Update 2: as we don't strictly need to use a remote builder, we dropped the remote option and build locally instead. We also had to go into Docker Desktop for every developer that had deployed and remove the remote builders (Settings -> Builders). Once we cleaned up the dial-stdio processes on the remote build machine we have rock-steady memory usage.

rogermarlow avatar Oct 29 '24 10:10 rogermarlow

@rogermarlow I fiddled on with the script(s) and now it looks like this (I have one named prod addressing deployment to production, too) and with this I can have my cake and eat it too 😄 (building remotely without risking exhausting the host)

# bin/stage
ssh docker5 ls
kamal env push --destination=staging
kamal deploy --destination=staging
echo Cleaning SSH local: `ps aux | grep 'ConnectTimeout' | wc -l` remote: `ssh docker5 -lroot "ps aux | grep 'ConnectTimeout' | wc -l"`
clean_ssh 2>&1 > /dev/null
echo Cleaned SSH local: `ps aux | grep 'ConnectTimeout' | wc -l` remote: `ssh docker5 -lroot "ps aux | grep 'ConnectTimeout' | wc -l"`
# bin/clean_ssh
kill -HUP `ps aux | grep 'ConnectTimeout' | awk '{ print $2}'` 2>&1 > /dev/null
ssh docker5 -lroot "kill \$(ps ax | grep 'docker system dial-stdio' | awk '{print \$1}')" 2>&1 > /dev/null

notes: Line 2 in the stage script "wakes up" the VM on Hetzner -- it's not necessary if youo're not 'on the cheap' 😆 Lines 5,7 are only for reporting - not necessary Line 2 in the clean_ssh script cleans local processes Line 3 does the same on the host

wdiechmann avatar Nov 01 '24 08:11 wdiechmann

Perhaps Kamal could stop the builder when its work is done:

docker buildx stop kamal-remote-ssh--username-hostname

Can try that out in your app with a post-deploy hook. In .kamal/hooks/post-deploy:

#!/bin/bash
docker buildx stop kamal-remote-ssh--yourbuilderusername-yourbuilderhostname

(Note: assumes Kamal 2 builder naming convention. Adjust for older Kamal 1 builder names like kamal-$service-native-remote)

jeremy avatar Nov 02 '24 21:11 jeremy

Stopping the remote builder didn't help. Removing the builder locally from Docker Desktop solves the issue, or just quitting Docker Desktop.

texpert avatar Jan 03 '25 23:01 texpert

not sure this does add any progress to solving this issue - but I noticed just the other day that without my local Docker Desktop (actually I'm on OrbStack but it's a 1:1 replacement - just faster) busily chugging along, my deploys/buildings didn't happen at all -- so to me it looks like the remote building thing is not happening (but I'm probably wrong)

it does, however, support your observations as to the local container env, @texpert

wdiechmann avatar Jan 06 '25 07:01 wdiechmann

This is a docker bug, so I'm going to close this.

djmb avatar Apr 23 '25 14:04 djmb

I know - it's not cool to comment "after closing hours" but I went through my logs and incidentally all my issues vanished when I stopped using Docker Desktop for OrbStack 😲

wdiechmann avatar Apr 23 '25 15:04 wdiechmann