uncloud connection refused from IPv6 dial upon deploying service

This occurred after initialising a new cluster with a custom IPv4 subnet range, but I don't think it's related:

$ uncloud service run --mode replicated --replicas 2 nginxdemos/hello
WARNING: failed to list containers on machine 'fdcc:ebf4:7c00:4c57:4b12:1dc3:136:3179': rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp [fdcc:ebf4:7c00:4c57:4b12:1dc3:136:3179]:51000: connect: connection refused"
[+] Running service hello-8pgu (replicated mode) 26/4
 ✔ Container hello-8pgu-9v7n on machine-ezhf  Started                                                                                                                                       7.7s
 ✔ Container hello-8pgu-c2jl on machine-1a73  Started                                                                                                                                       6.6s



hello-8pgu endpoints

The IP corresponds to an IPv6 IP on one of the nodes' uncloud interface:

$ ip a show uncloud
5: uncloud: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default
    link/none
    inet 100.64.1.1/24 brd 100.64.1.255 scope global uncloud
       valid_lft forever preferred_lft forever
    inet6 fdcc:ebf4:7c00:4c57:4b12:1dc3:136:3179/128 scope global
       valid_lft forever preferred_lft forever

Mar 16 '25 06:03 cedws

Uncloud daemon logs:

Mar 16 06:49:59 uncloud-2 systemd[1]: Starting uncloud.service - Uncloud machine daemon...
Mar 16 06:49:59 uncloud-2 uncloudd[3978]: {"level":"warn","ts":1742107799.7137914,"msg":"unable to determine directory for user configuration; falling back to current directory","error":"neither $XDG_CONFIG_HOME nor $HOME are defined"}
Mar 16 06:49:59 uncloud-2 uncloudd[3978]: INFO  Machine state file not found, creating a new one. path=/var/lib/uncloud/machine.json
Mar 16 06:49:59 uncloud-2 uncloudd[3978]: INFO  Generated machine key pair. pubkey=ebf47c004c574b121dc30136317944ff4d366fed3a695b7aeb8f8b15a72fb402
Mar 16 06:49:59 uncloud-2 uncloudd[3978]: INFO  Starting machine.
Mar 16 06:49:59 uncloud-2 uncloudd[3978]: INFO  Configured corrosion service. dir=/var/lib/uncloud/corrosion
Mar 16 06:49:59 uncloud-2 systemd[1]: uncloud.service: Got notification message from PID 3985, but reception only permitted for main PID 3978
Mar 16 06:49:59 uncloud-2 uncloudd[3978]: INFO  Corrosion systemd service started. unit=uncloud-corrosion.service
Mar 16 06:50:01 uncloud-2 uncloudd[3978]: INFO  Starting local machine API server. path=/run/uncloud/machine.sock
Mar 16 06:50:01 uncloud-2 uncloudd[3978]: INFO  Starting local API proxy server. path=/run/uncloud/uncloud.sock
Mar 16 06:50:01 uncloud-2 uncloudd[3978]: INFO  Waiting for the machine to be initialised as a member of a cluster to start the network controller.
Mar 16 06:50:01 uncloud-2 systemd[1]: Started uncloud.service - Uncloud machine daemon.
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Machine configured to join the cluster. id=041cda413b693e679be3c7d26bc0f6b9 name=machine-1a73
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Configured corrosion service. dir=/var/lib/uncloud/corrosion
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Starting network controller.
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Starting WireGuard network.
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Created WireGuard interface. name=uncloud
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Configured WireGuard interface. name=uncloud
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Updated addresses of the WireGuard interface. name=uncloud addrs="[fdcc:ebf4:7c00:4c57:4b12:1dc3:136:3179/128 100.64.1.1/24]"
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Brought WireGuard interface up. name=uncloud
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: DEBUG Added route to peer(s) via WireGuard interface. name=uncloud dst=100.64.0.0/24
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: DEBUG Added route to peer(s) via WireGuard interface. name=uncloud dst=fdcc:c83c:9121:c336:692e:4c19:acd5:b637/128
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: DEBUG Removed route to peer(s) via WireGuard interface. name=uncloud dst=100.64.1.0/24
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: DEBUG Removed route to peer(s) via WireGuard interface. name=uncloud dst=fdcc:ebf4:7c00:4c57:4b12:1dc3:136:3179/128
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Updated routes to peers via the WireGuard interface. name=uncloud peers=1
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  WireGuard network configured.
Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Restarting corrosion service to apply new configuration with WireGuard network.
Mar 16 06:50:07 uncloud-2 uncloudd[3978]: INFO  Corrosion systemd service restarted. unit=uncloud-corrosion.service
Mar 16 06:50:07 uncloud-2 systemd[1]: uncloud.service: Got notification message from PID 4056, but reception only permitted for main PID 3978
Mar 16 06:50:09 uncloud-2 uncloudd[3978]: INFO  Starting network API server. addr=[fdcc:ebf4:7c00:4c57:4b12:1dc3:136:3179]:51000
Mar 16 06:50:09 uncloud-2 uncloudd[3978]: INFO  Starting Caddyfile controller.
Mar 16 06:50:10 uncloud-2 uncloudd[3978]: INFO  Subscribed to machine changes in the cluster to reconfigure network peers.
Mar 16 06:50:10 uncloud-2 uncloudd[3978]: INFO  Reconfiguring network peers with the current machines. machines=2
Mar 16 06:50:10 uncloud-2 uncloudd[3978]: INFO  Configured WireGuard interface. name=uncloud
Mar 16 06:50:10 uncloud-2 uncloudd[3978]: INFO  Updated addresses of the WireGuard interface. name=uncloud addrs="[fdcc:ebf4:7c00:4c57:4b12:1dc3:136:3179/128 100.64.1.1/24]"
Mar 16 06:50:10 uncloud-2 uncloudd[3978]: INFO  Brought WireGuard interface up. name=uncloud
Mar 16 06:50:10 uncloud-2 uncloudd[3978]: INFO  Updated routes to peers via the WireGuard interface. name=uncloud peers=1
Mar 16 06:50:10 uncloud-2 uncloudd[3978]: INFO  Subscribed to container changes in the cluster to generate Caddy configuration.
Mar 16 06:50:10 uncloud-2 uncloudd[3978]: INFO  Peer status changed. public_key=c83c9121c336692e4c19acd5b6376f2ecb334fc388c7b0012506073723bf0a73 status=up previous_status=unknown
Mar 16 06:50:11 uncloud-2 uncloudd[3978]: INFO  Docker network created. name=uncloud subnet=100.64.1.0/24
Mar 16 06:50:11 uncloud-2 uncloudd[3978]: INFO  Docker network configured.
Mar 16 06:50:11 uncloud-2 uncloudd[3978]: INFO  Watching Docker containers and syncing them to cluster store.
Mar 16 06:50:11 uncloud-2 uncloudd[3978]: DEBUG Syncing containers to cluster store before processing Docker events.
Mar 16 06:50:17 uncloud-2 uncloudd[3978]: DEBUG Cluster containers changed, updating Caddy configuration.
Mar 16 06:50:17 uncloud-2 uncloudd[3978]: DEBUG Updated Caddy configuration. path=/var/lib/uncloud/caddy/caddy.json
Mar 16 06:50:23 uncloud-2 uncloudd[3978]: DEBUG Syncing containers to cluster store triggered by a Docker container event. container_id=ff35327e2bfef93df142414e83febc973b52cd725027a042578e3313895ee498 container_name=hello-8pgu-c2jl action=create
Mar 16 06:50:24 uncloud-2 uncloudd[3978]: DEBUG Syncing containers to cluster store triggered by a Docker container event. container_id=ff35327e2bfef93df142414e83febc973b52cd725027a042578e3313895ee498 container_name=hello-8pgu-c2jl action=start
Mar 16 06:50:24 uncloud-2 uncloudd[3978]: DEBUG Container record updated in store DB. id=ff35327e2bfef93df142414e83febc973b52cd725027a042578e3313895ee498 machine_id=041cda413b693e679be3c7d26bc0f6b9
Mar 16 06:50:24 uncloud-2 uncloudd[3978]: DEBUG Cluster containers changed, updating Caddy configuration.
Mar 16 06:50:24 uncloud-2 uncloudd[3978]: DEBUG Updated Caddy configuration. path=/var/lib/uncloud/caddy/caddy.json
Mar 16 06:50:41 uncloud-2 uncloudd[3978]: DEBUG Syncing containers to cluster store triggered by a regular interval. interval=30s

Mar 16 '25 07:03 cedws

Thank you for reporting this with debug logs! Did you run the run command shortly after initialising the cluster and adding machines? Am I right that you have 2 machines in the cluster and regardless of the warning the service was able to run 2 containers on each of them successfully?

Mar 16 '25 10:03 psviderski

Yep that's right, two node cluster. I believe both containers were started successfully.

I just tried reproducing this in the same way and didn't get it. Will leave any notes if it happens again.

Mar 18 '25 12:03 cedws

This is what likely happened: The service run command lists all containers on all machines before creating a new service container to check that the new service name is unique. As I mentioned earlier, the cluster store is not the source of truth for this data, the containers on machines labeled with uncloud.service.name are.

When the second machine uncloud-2 was added to the cluster, it started configuring its subsystems asynchronously. Which means machine add command finished successfully but didn't wait for the machine to become fully initialised and ready for service. According to the logs, it took 7 seconds for it to start the API service after being added to the cluster:

Mar 16 06:50:02 uncloud-2 uncloudd[3978]: INFO  Machine configured to join the cluster. id=041cda413b693e679be3c7d26bc0f6b9 name=machine-1a73
...
Mar 16 06:50:09 uncloud-2 uncloudd[3978]: INFO  Starting network API server. addr=[fdcc:ebf4:7c00:4c57:4b12:1dc3:136:3179]:51000

Your service run command was likely executed somewhere between 06:50:02 and 06:50:09 when uncloud-2 didn't run the API server on the uncloud WireGuard interface. However, I suspect it took a few seconds for the container to pull image and start on the first machine so by the time it tried to run a container on the second machine the API server was already running.

I also suspect that you initialised machines with --no-caddy and --no-dns which skip the readiness check. I can think of two options how to address this:

Always wait for the machine to become ready for init and add commands. It's easy to implement. However, there will still be cases when a machine is rebooted or the daemon is restarted which will cause the same failure.
Implement a robust system for detecting when a machine is down or down temporarily. For down temporarily we can implement retries with a reasonable timeout.

The option 1. can be implemented as a quick fix for this particular issue but 2. will need to be implemented longer term anyway as this affects any operations in the cluster

Mar 18 '25 23:03 psviderski