helm-zabbix
helm-zabbix copied to clipboard
upgrade issue
Describe the bug unable to upgrade from 6.0.13 to 6.4. I can't seem to find a way to comment out HANodeName so that the server can start up in standalone mode.
Version of Helm and Kubernetes: 1.27
Any suggestions?
Hi @hooray4me!
Sorry by late.
Today I published versions 4.0.0 and 4.0.1 of the chart which contains some important changes. I recommend that you read and test.
The HA mode of the Zabbix Server can be disabled with the following values:
zabbixServer:
enabled: true
replicaCount: 1
HA mode only works with two or more Zabbix Server replicas.
Today I tried to upgrade zabbix from 6.0.9 to 6.4.7 and even with
zabbixServer:
enabled: true
replicaCount: 1
it still starts in HAmode:
8:20231026:124626.896 current database version (mandatory/optional): 06000000/06000043
8:20231026:124626.896 required mandatory version: 06040000
8:20231026:124626.896 mandatory patches were found
8:20231026:124626.906 cannot perform database upgrade in HA mode: all nodes need to be stopped and Zabbix server started in standalone mode for the time of upgrade.
Zabbix upgrade documentation says: " [...] change its configuration to standalone mode by commenting out HANodeName [parameter]" So I tried to add
- name: "ZBX_HANODENAME"
value:
to zabbixServer.extraEnv:
but deployment ignores it:
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherAll13": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherCert": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherCert13": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherPSK": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSCipherPSK13": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSKeyFile": 'privatekey'...added
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSPSKIdentity": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "TLSPSKFile": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "ServiceManagerSyncFrequency": ''...removed
** Updating '/etc/zabbix/zabbix_server.conf' parameter "HANodeName": 'zabbix-services-zabbix-server-ddff74775-rhl4z'...added
** Updating '/etc/zabbix/zabbix_server.conf' parameter "NodeAddress": '10.66.34.204'...added
** Updating '/etc/zabbix/zabbix_server.conf' parameter "User": 'zabbix'...added
Changing docker images gave no result so I assume helm somehow defines ZBX_HANODENAME=hostname
P.S. removing ZBX_HANODENAME or setting it to null didnt take any effect
So in the end I was able to upgrade from from 6.0.9 to 6.4.8 Used helm chart version 4.0.2, image alpine-6.4-latest
Problem was with undocumented parameter ZBX_AUTOHANODENAME which is hardcoded into chart, is always present on pod and is responsible for starting server in HA mode.
Interestingly enough I could set
- ZBX_AUTOHANODENAME
value: ""
only without any other parameters in zabbixServer.extraEnv:
If any other parameter (in this case ZBX_HANODENAME) was present it resulted in error:
client.go:428: [debug] error updating the resource "zabbix-zabbix-server":
cannot patch "zabbix-zabbix-server" with kind Deployment: The order in patch list:
[map[name:ZBX_AUTOHANODENAME value:hostname] map[name:ZBX_AUTOHANODENAME value:] map[name:ZBX_HANODENAME value:]]
doesn't match $setElementOrder list:
[map[name:DB_SERVER_HOST] map[name:DB_SERVER_PORT] map[name:POSTGRES_USER] map[name:POSTGRES_PASSWORD] map[name:POSTGRES_DB] map[name:ZBX_AUTOHANODENAME] map[name:ZBX_HANODENAME] map[name:ZBX_AUTOHANODENAME] map[name:ZBX_NODEADDRESS] map[name:ZBX_WEBSERVICEURL] map[name:ZBX_STARTREPORTWRITERS]]
So I did two deployment cycles - one without any parameters except ZBX_AUTOHANODENAME, and after DB was upgraded second cycle with all usual parameters without ZBX_AUTOHANODENAME.
It is by design that the Zabbix server does ALWAYS start in HA-Mode even with Replicas set to 1. This is in order to make sure that a scale-up does just work, and has, at least didn't have when I developed that part, no negative effect except being a "HA cluster with just one node". The issue with upgrading major version is not entirely solved yet. The best workaround, if I understand your post correctly would be to scale down to just one replica, then do the upgrade, then scale up again. Or do I get something completely wrong?
Sorry, I did not read carefully. So, the problem is that apparently recently Zabbix server does not accept to upgrade the database if run in HA mode. This is actually new to me. Let me think about how to solve this in most elegant way... First shot of idea: We have a job that runs in single mode that prepares database before the "real" Zabbix server pods start up, and which is designed to prepare the database structure in case of a fresh installation. I am thinking in a similar solution for upgrading:
- in the sidecars of the Zabbix Server pods, which prevent those from starting when no database is there yet, add a check to figure out that a major release upgrade is necessary and prevent Zabbix servers from starting
- start a job that does this upgrade, using the Zabbix Server image but starting it one-shot and only to upgrade database
- then let the Zabbix Server(s) start
Yep, exactly - Zabbix server does not accept to upgrade the database if run in HA mode. As for solution, sounds great if could be implemented in that way
I am in incubating phase for finding a solution :)
for me, setting ZBX_AUTOHANODENAME
to ""
(w/o specifying ZBX_HANODENAME
in values.yaml
in any way whatsoever) during the upgrade did the trick. the other extra env variables (I use TimescaleDB, so can't do w/o them) I didn't touch.
UPD: and setting replicaCount
to 1 during the upgrade, of course.
An upgrade from 6 to 7 unfortunately fails (well it actually doesn't fail but it doesn't complete entirely) when using TimescaleDB due to the fact that the timescaledb.sql
must be executed once again to create the newly needed hypertable:
229:20240619:083705.201 [Z3005] query failed: [0] PGRES_FATAL_ERROR:ERROR: table "auditlog" is not a hypertable
I am wondering whether the best way to solve this once for all is to create a post-install and post-upgrade hook job that handles all the database schema relevant tasks. Up to now we have one Job, simply being deployed with the Chart and only taking care of initializing the database. The good thing was that it was not needed to create a custom image for that, just a bit of sed-magic. I think this has to be redesigned entirely, also for future use cases, having ONE custom image taking care of:
- creating empty database schema if none existing
- upgrading database schema in case a major release upgrade happened
- initializing / upgrading TiimescaleDB stuff
It should be built as a custom image, or at least using an entrypoint script mounted as a configmap or such, but the image should be based on the Zabbix Server image (needed for the actual Upgrade of DB schema).
From my point of view, this should also fix the above found problem when Zabbix Server is running in HA mode.
Any more comments on this? Will investigate further during the next days.
@fibbs I was able to upgrade from Zabbix 6.5 to 7 with following steps.
- Edit values to scale down Zabbix Server to replicaCount: 0 and deploy using zabbix-community/zabbix
- Clone helm chart source and comment out ZBX_AUTOHANODENAME config (name and value) in https://github.com/zabbix-community/helm-zabbix/blob/master/charts/zabbix/templates/deployment-zabbix-server.yaml#L142
- Deploy from this local clone with replicaCount: 1
- Follow container logs until DB upgrade was complete.
- Login and test
Now when scaling the server back to original replicaCount value of 3 I get the following error
Error: UPGRADE FAILED: error validating "": error validating data: ValidationError(Job.spec): unknown field "metadata" in io.k8s.api.batch.v1.JobSpec
Same error occurs when deploying from zabbix-community/zabbix or from local clone.
Looking into it I see an if statement that affects how things are deployed depending on the replicaCount, so will look to understand this more. https://github.com/zabbix-community/helm-zabbix/blob/master/charts/zabbix/templates/job-init-db-schema.yaml#L1
Once found I am guessing a PR with the same if statement could be applied to disable HA automatically for replicaCount: 1 so the ZBX_AUTOHANODENAME is not applied.
When the server is started in single mode, it automatically upgrades the DB it self, and therefore I am questioning the need for the job at all, if Zabbix has changed it behaviour as mentioned in a comment above.
By adding a false condition to the top of the job template as well as only applying ZBX_AUTOHANODENAME if replicaCount was greater than 1, I was able to use the chart to upgrade from 6.5-7.
I have made a PR #102 in case it helps someone else, but I cannot comment on the validity of removing the Job entirely beyond "it worked for me"
thanks @crowleym, indeed that's exactly the way I did upgrades, but it is a bit "hacky" and shouldn't be this way, which is why I am working on a good solution. I don't want this Helm Chart to run the Zabbix server in "single mode", even when only having one Replica. We have defined this back then when DB upgrades worked also in HA mode because of wanting to be able to scale up and down at any time.
I have an almost-working solution here in my lab, with one or two challenges to solve. One of them is to start a zabbix_server process to only upgrade the database schema and then stop, which I will try to achieve with a hacky "start process in background and loop reading its STDOUT" kind of construct. The solution will work as follows, briefly:
- zabbix server runs in HA mode, even if only having one replica. I don't want to change this
- in any helm installation or upgrade, all available zabbix server pods will start and be hold back by an init container, waiting for a database not only to be available but also to have the correct version
- an additional job is being started, based on the "zabbix_server" container image, which achieves the magic of preparing the database, and also to upgrade the schema in case a major release upgrade happened
This is almost exactly the same as it is designed to work right now, with the following changes:
- this "after install/upgrade job" (indeed, I will probably change this to be a post-upgrade / post-install hook in the helm chart) will get one additional task: the upgrade of the db schema which is being performed by zabbix_server
- the init container coming up with any zabbix server pod will not only wait for availability of the db, but also for the right version
That should work fine then and without manual intervention.
Of course, it would be awesome if Zabbix themselves would implement a zabbix_server --only-upgrade-db
or something, so that this Job container could be less hacky. I will probably try to get into discussion with the "right people" and try to convince them make our lives easier.
Stay tuned, an upgrade will come.
Hi @fibbs , I wonder if you managed to raise the issue with Zabbix SIA, if there is a support ticket we could upvote?
I am facing the same problem here, although I don't use this helm project, I have my own methodology with different specs. And I got stuck also with the problem.
I went around looking if someone had found a solution and I see this ticket here and something related on the zabbix forums, to no avail anyhow.
I came up with some ideas, but absolutely the best solution would be a flag with --only-upgrade-db
kinda switch, provided by them.
As I compile my own binaries and build my own images, I was thinking that I could snoop in the source code and catch the latest DBPATCH_VERSION(integer)
and flag it on the entrypoint, check with the database if it requires an update, do the necessary changes, bail the zabbix_server when that is finished, add back the ha configuration, restart pod...
But this is so ugly that I am not really happy pursuing it, so maybe, I would patch zabbix source code myself to build the image if I see that Zabbix SIA will take a long time to release a solution for it.
In the end I also find it beneficial to add a new status on the HA node to inform other nodes that the database is under upgrade, they would just back off, until the node executing upgrades would just mark it finished and/or assume an active role -- that would fix the problem to avoid having a "only-upgrade-db" switch but would incur in a larger patch.
If its possible , I would be glad to know the status of the conversation with Zabbix SIA and about the ticket... and I will also decide if I go forward writing / using a patched zabbix_server binary.
Hey, I wanted to check in again on this topic here... I haven't had so much time for this during the last months, so sorry.
I have indeed filed a ZBXNEXT at Zabbix with this topic: https://support.zabbix.com/browse/ZBXNEXT-9453?filter=-2, but unfortunately there has not been much movement on it yet. Also, I didn't yet have the opportunity to grab one of the developers being responsible for this for a beer, but I will keep trying.
Meanwhile, I did one successful test, I have started a simple docker container with postgres, and another with Zabbix Server 6.4, let it all start up nicely, and then stopped the zabbix server container and started a 7.0 server container with entrypoint /bin/bash and an interactive shell.
Once in the container (and manually creating a working zabbix_server.conf file) I have created this little script called "killer.sh":
#!/bin/bash
# Create a named pipe
PIPE="/tmp/zabbix_output_pipe"
mkfifo "$PIPE"
# Start the Zabbix server, redirecting its output to the named pipe
/usr/sbin/zabbix_server --foreground -c /etc/zabbix/zabbix_server.conf > "$PIPE" 2>&1 &
# Get the process ID of the background job
ZABBIX_PID=$!
# Monitor the output from the named pipe
while IFS= read -r line < "$PIPE"; do
echo "$line" # Print each line for visibility
# Check if the line contains the string "starting HA manager"
if [[ "$line" == *"starting HA manager"* ]]; then
echo "Found 'starting HA manager' - killing process"
kill "$ZABBIX_PID"
break
fi
done
# Clean up by removing the named pipe
rm "$PIPE"
...and it works as expected: it starts Zabbix Server, lets it do "its thing" and kills it as soon it's over the "database stuff":
4ff5a96179fe:/var/lib/zabbix$ /tmp/killer.sh
Starting Zabbix Server. Zabbix 7.0.5 (revision 9406e67).
Press Ctrl+C to exit.
11:20241112:200356.584 current database version (mandatory/optional): 06040000/06040036
11:20241112:200356.584 required mandatory version: 07000000
11:20241112:200356.584 mandatory patches were found
11:20241112:200356.585 starting automatic database upgrade
11:20241112:200356.589 completed 0% of database upgrade
11:20241112:200356.592 completed 1% of database upgrade
11:20241112:200356.594 completed 2% of database upgrade
11:20241112:200356.599 completed 3% of database upgrade
11:20241112:200356.602 completed 4% of database upgrade
11:20241112:200356.608 completed 5% of database upgrade
11:20241112:200356.611 completed 6% of database upgrade
11:20241112:200356.613 completed 7% of database upgrade
11:20241112:200356.615 completed 8% of database upgrade
11:20241112:200356.618 completed 9% of database upgrade
11:20241112:200356.622 completed 10% of database upgrade
...
11:20241112:200357.184 completed 99% of database upgrade
11:20241112:200357.186 completed 100% of database upgrade
11:20241112:200357.194 database upgrade fully completed
12:20241112:200357.196 starting HA manager
Found 'starting HA manager' - killing process
So, with this we have a theoretical and elegant solution that would work maintaining the HA mode enabled in the Helm chart config, handling the upgrade of a major release correctly. Let's assume we have the chart deployment running with two pods in HA mode, and we issue a helm upgrade: then a pre-upgrade hook job (started by helm, after rendering the manifests but BEFORE updating any resources inside Kubernetes) should run, doing the following:
- check database availability, just as now, waiting until it gets a connection
- checking database's
dbversion
table againstzabbix_server --version
output and finding out whether a major release upgrade is going to happen - if so: scale the zabbix_server deployment down to 0 replicas - for this a serviceaccount with proper permissions would be needed
- check in the database that no "active member" is connected to the Zabbix database
- prepare zabbix_server.conf based on env variables, just as zabbix_server image's
docker-entrypoint.sh
does - start the zabbix_server binary with the trick above, to just let it go through until the database schema is upgraded
- end the job gracefully
As the job will be started as a pre-upgrade job, the deployment will then be upgraded with the correct zabbix server version and replica number.
The pre-upgrade job will require a custom image, which could be based off the zabbix_server image but would have to contain a kubectl binary and probably some helper scripts. I will start playing developing that in my test environment. If anybody wants to participate, we will find a way.
Short update here. Have it running and everything seems to work as expected. Merge request will follow as soon as I have a solution implemented for #98 Please stay tuned!