2025-03-03 ZeroTier - health checking - alternative proposal
This PR follows on from the extensive discussion associated with #37.
Never before have I even contemplated submitting a PR covering the same ground as an existing open PR. However, on this occasion I thought it might be useful to have a concrete proposal to compare and contrast with #37.
I sincerely hope that laying this on the (virtual) table and then minimising further interaction might help us converge on a solution.
-
docker-compose.ymlanddocker-compose-router.yml:-
replaces deprecated
versionstatement with---. -
adds example environment variables.
-
-
Dockerfile-
corrects case of "as" to "AS" (silences build warning).
-
adds and configures
healthcheck.sh(as per #37). -
includes
tzdatapackage (moved fromDockerfile.router) so messages have local timestamps.
-
-
Dockerfile.router- removes
tzdata(moved toDockerfile).
- removes
-
entrypoint-router.sh:- code for first launch auto join of listed networks expanded to include additional help material.
-
entrypoint.sh:-
"first launch" auto join of listed networks (code copied from
entrypoint-router.sh, as modified per above). -
"self repair" of permissions in persistent store (code copied from
entrypoint-router.sh). -
adds launch-time message to make it clear that the client is launching (complements messages in
entrypoint-router.sh). -
abstracts some common strings to environment variables (opportunistic change).
-
-
README.md:-
updates examples.
-
describes new environment variables (including move of
ZEROTIER_ONE_NETWORK_IDSfromREADME-router.md. -
documents health-checking.
-
-
README-router.md- updates examples.
- explains relationship of router and client.
Added:
-
healthcheck.sh, based on original proposal in #37 and subsequent suggestions for modification by me.
I gave serious consideration to the code for synchronising networks in the entry point scripts. The idea is quite attractive. It is safe to automate joins in a "clean slate" situation. However, a leave followed by a join is not guaranteed to be idempotent. That's because the leave destroys the network-specific configuration options (allowManaged, allowGlobal, allowDefault, allowDNS).
On balance I think it's better left to users to send explicit leave commands via the CLI and take responsibility for restoring lost configuration options on any subsequent join.
I will post the results of testing this PR separately.
Additional changes as at 2025-04-06
healthcheck.sh:
-
Full rewrite, including copious comments explaining theory of operation. In essence, if a «networkID» is mentioned in (internal path):
/var/lib/zerotier-one/networks.dthen it should be matched by a route in the host's routing table:
- Zero «networkID» = zero routes
- One «networkID» = one route
- Two «networkID» = two routes
- ...
Any mismatch causes the container to go unhealthy. From the perspective of the health-checking script, the question of which network is immaterial so there is no need to employ techniques such as iterating to discover which network is causing a problem. This is because the script has no way of communicating anything other than an exit status. Any
echostatements it issues will not make it into the container's log.The simple presence/absence of a «networkID» in
networks.dis taken to indicate which networks the user intends the container to join.There is no reliance on environment variables to propagate any health-checking information into the container. No other variables are introduced so the argument about naming conventions goes away.
Dockerfile:
- Removes
--start-intervalflag fromHEALTHCHECKcommand. The flag was preventingbuildahbuilds from succeeding.
README.md:
-
Removes references to the following environment variables:
-
ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS -
ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH
-
-
Rewrites explanation of health-checking.
README-router.md:
-
Removes references to the following environment variables:
-
ZEROTIER_ONE_CHK_SPECIFIC_NETWORKS -
ZEROTIER_ONE_CHK_MIN_ROUTES_FOR_HEALTH
-
@Paraphraser @zyclonite
My humble question for a small use case scenario : What if the user wants to check ALL networks he has joined (which can change dynamically) ? How do we check that in this proposal ?
first i was on vacation and last week a bit downing in work - sorry for the delay
regarding buildah version - i fear we have to live with the one provided with the ubuntu 24 gh actions runner (they might upgrade it from time to time) i did attempt to upgrade it individually in the past but they have quite some security hurdles around that, so it's not easy to achieve (this was my old project but it does not work anymore with the latest runners https://github.com/zyclonite/setup-podman)
about the sponsored by comment - i fully support individual contributions, so you are free to add your real name but i would like to not go down the route with having companies sponsoring code as this might be tricky from a licensing perspective if not in sync with the individual contributor and so on...
@zyclonite After due discussions with PMGA Tech LLP, I have changed the License text to match most major MIT license texts: E.g. TailWindCSS https://github.com/tailwindlabs/tailwindcss/blob/main/LICENSE VS-Code (By Microsoft) https://github.com/microsoft/vscode/blob/main/LICENSE.txt This should alleviate your fears as the text is matching most major MIT projects.
However, as per my agreement with PMGA Tech LLP, the code cannot be used without attaching the proper licensing text. It is now not possible for me to go back on it.
Further, I still do not understand the fuss behind all this as many major MIT projects owned by corporates also include license files (Examples already given above); and also since ZeroTier itself is not MIT but BSL, you can check here: https://github.com/zerotier/ZeroTierOne?tab=License-1-ov-file
As I do not want to take this argument further, so its your call on whether to merge the code or cancel it.
Please note one more thing, in case you intend to merge this proposal, the proper licensing text needs to be copied from PR #37 . I'm afraid if the text is not attached 'as-is', it would be a deliberate copywrite infringement since knowingly the same has been removed and/or not attached.
I would suggest you merge PR #37 since this is just a copy of that MR with minor changes as already admitted by the OP in the first post. (attaching screenshot for future reference in case required)
Testing (with 2025-04-06 changes)
Reference service definition
zerotier:
container_name: zerotier
image: "zyclonite/zerotier:local"
restart: unless-stopped
network_mode: host
environment:
- TZ=${TZ:-Etc/UTC}
# - ZEROTIER_ONE_NETWORK_IDS=${ZEROTIER_ONE_NETWORK_IDS}
volumes:
- ./volumes/zerotier-one:/var/lib/zerotier-one
devices:
- "/dev/net/tun:/dev/net/tun"
cap_add:
- NET_ADMIN
- SYS_ADMIN
Note:
-
ZEROTIER_ONE_NETWORK_IDScommented-out to disable auto-join in clean-slate situation.
Test 1 - clean slate
-
Show container not running and no persistent store:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES $ ls -ld ~/IOTstack/volumes/zerotier-one ls: cannot access '/home/moi/IOTstack/volumes/zerotier-one': No such file or directory -
Start container
$ docker compose up -d zerotierShow persistent store created:
$ ls -ld ~/IOTstack/volumes/zerotier-one drwxr-xr-x 4 999 994 4096 Apr 6 09:46 /home/moi/IOTstack/volumes/zerotier-oneShow networks directory does not exist:
$ ls -ld ~/IOTstack/volumes/zerotier-one/networks.d ls: cannot access '/home/moi/IOTstack/volumes/zerotier-one/networks.d': No such file or directoryShow container healthy:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 3 minutes ago Up 3 minutes (healthy) zerotier
Test 2 - Join first network
-
Join network:
$ docker exec zerotier zerotier-cli join 9999888877776666 200 join OKShow container goes unhealthy:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 6 minutes ago Up 6 minutes (unhealthy) zerotierExplore reason:
$ docker exec zerotier zerotier-cli listnetworks 200 listnetworks <nwid> <name> <mac> <status> <type> <dev> <ZT assigned ips> 200 listnetworks 9999888877776666 22:ef:5f:10:91:a9 ACCESS_DENIED PRIVATE ztr2qsmswx - $ docker exec zerotier zerotier-cli get 9999888877776666 status ACCESS_DENIEDAuthorise client in ZeroTier Central. Then show container goes healthy.
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 10 minutes ago Up 10 minutes (healthy) zerotier
Test 3 - interrupt first network
-
List networks:
$ ip r | grep "dev zt.* scope link" 10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233 -
Destroy network:
$ sudo nmcli conn down ztr2qsmswx Connection 'ztr2qsmswx' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/144)Show route removed:
$ ip r | grep "dev zt.* scope link" $Show container goes unhealthy:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 12 minutes ago Up 12 minutes (unhealthy) zerotierShow agent unaware of problem:
$ docker exec zerotier zerotier-cli get 9999888877776666 status OK -
Restart container:
$ docker compose restart zerotier [+] Restarting 1/1 ✔ Container zerotier Started 1.5sShow route restored:
$ ip r | grep "dev zt.* scope link" 10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233Show container healthy:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 13 minutes ago Up 9 seconds (healthy) zerotier
Test 4 - join second network
-
Join network:
$ docker exec zerotier zerotier-cli join 9999888877775555 200 join OKShow container goes unhealthy:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 17 minutes ago Up 4 minutes (unhealthy) zerotierExplore reason:
$ docker exec zerotier zerotier-cli get 9999888877775555 status ACCESS_DENIEDAuthorise client in ZeroTier Central. Then show container goes healthy.
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 18 minutes ago Up 5 minutes (healthy) zerotier
Test 5 - interrupt one network
-
List networks:
$ ip r | grep "dev zt.* scope link" 10.242.0.0/16 dev ztc3qzoglu proto kernel scope link src 10.242.235.233 10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233 -
Destroy one of the networks (would not matter which one):
$ sudo nmcli conn down ztr2qsmswx Connection 'ztr2qsmswx' successfully deactivated (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/145)Show route removed:
$ ip r | grep "dev zt.* scope link" 10.242.0.0/16 dev ztc3qzoglu proto kernel scope link src 10.242.235.233 $Show container goes unhealthy:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 25 minutes ago Up 12 minutes (unhealthy) zerotierShow agent unaware of problem:
$ docker exec zerotier zerotier-cli listnetworks 200 listnetworks <nwid> <name> <mac> <status> <type> <dev> <ZT assigned ips> 200 listnetworks 9999888877776666 My_ZeroTier 22:ef:5f:10:91:a9 OK PRIVATE ztr2qsmswx 10.244.235.233/16 200 listnetworks 9999888877775555 Test 5e:d2:83:cc:ff:c4 OK PRIVATE ztc3qzoglu 10.242.235.233/16 -
Restart container:
$ docker compose restart zerotier [+] Restarting 1/1 ✔ Container zerotier Started 1.5sShow route restored:
$ ip r | grep "dev zt.* scope link" 10.242.0.0/16 dev ztc3qzoglu proto kernel scope link src 10.242.235.233 10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233Show container healthy:
$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 0e5b59715ff3 zyclonite/zerotier:local "entrypoint.sh -U" 27 minutes ago Up 23 seconds (healthy) zerotier
Test 6 - down, up the container
$ docker compose down zerotier
[+] Running 1/1
✔ Container zerotier Removed 2.4s
$ ip r | grep "dev zt.* scope link"
$ docker compose up -d zerotier
[+] Running 1/1
✔ Container zerotier Started 0.2s
$ ip r | grep "dev zt.* scope link"
10.242.0.0/16 dev ztc3qzoglu proto kernel scope link src 10.242.235.233
10.244.0.0/16 dev ztr2qsmswx proto kernel scope link src 10.244.235.233
$ docker exec zerotier zerotier-cli listnetworks
200 listnetworks <nwid> <name> <mac> <status> <type> <dev> <ZT assigned ips>
200 listnetworks 9999888877776666 My_ZeroTier 22:ef:5f:10:91:a9 OK PRIVATE ztr2qsmswx 10.244.235.233/16
200 listnetworks 9999888877775555 Test 5e:d2:83:cc:ff:c4 OK PRIVATE ztc3qzoglu 10.242.235.233/16
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
c2111263a821 zyclonite/zerotier:local "entrypoint.sh -U" 8 seconds ago Up 8 seconds (healthy) zerotier