awx-operator
awx-operator copied to clipboard
when changing storage size for Postgres error in playbook causes loop
Please confirm the following
- [X] I agree to follow this project's code of conduct.
- [X] I have checked the current issues for duplicates.
- [X] I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.
Bug Summary
If you change the value in postgres_storage_requirements
it caused an error to occur. This is because the Statefulset isn't able to change that value. The task Create Database if no database is specified
in database_configuration.yml
fails. This drops it down to the rescue which scales down everything to 0.
Then on the task Remove PostgreSQL statefulset for upgrade
(which in this case, should be run) it fails to evaluate to the when
statement because create_statefulset_result.error
does not exist. But in this case, removing the Statefulset is what is required.
AWX Operator version
2.7.2
AWX version
23.4.0
Kubernetes platform
kubernetes
Kubernetes/Platform version
1.26.7
Modifications
no
Steps to reproduce
Have a functioning AWX environment using a managed Postgres pod.
Change the kustomization for the AWX environment to change the value of postgres_storage_requirements
. This can be done by either adding it where it wasn't previously used and setting the values to something other than the default, or by increasing the current allocation.
Expected results
The statefulset should be deleted and recreated with the new PVC size as defined.
Actual results
Playbook fails causing the AWX environment to be scaled to 0 for all pods and then getting stuck in a loop attempted to update the statefulset.
Additional information
Once you are stuck in this state, you can manually delete the statefulset and then allow the operator to see the statefulset is missing and have it re-create it. After that is done, the deployment continues and the environment is brought back up.
Operator Logs
The conditional check 'create_statefulset_result.error == 422' failed. The error was: error while evaluating conditional (create_statefulset_result.error == 422): 'dict object' has no attribute 'error'. 'dict object' has no attribute 'error'.
The error appears to be in '/opt/ansible/roles/installer/tasks/database_configuration.yml': line 175, column 7, but may be elsewhere in the file depending on the exact syntax problem.
@tylergmuir I am confused about your description above. If we changed the operator so that it deleted the PostgreSQL StatefulSet if the storage size was change, the PVC that the data is stored in would not be deleted. So when the new StatefulSet was created, it would not enter the running state because the existing PVC would have the same name as the one the new StatefulSet would be dynamically trying to create. So I think the StatefulSet would try to use the existing PVC, and would try to change resources.requests.storage on the PVC, which is only allowed in the StorageClass specified supports and has specified allowVolumeExpansion: true
if I recall correctly.
The problem is that not all users will have StorageClasses that support dynamic expansion.
So, if I am following correctly, we could potentially add logic to support PVC expansion for the db pvc by doing the following:
- Add a task that compares the existing statefulset and the postgres_storage_requirements value on the spec, and if they are different, delete the StatefulSet and re-create it
- Add error handling here so that if a user specifies a new storage size and their StorageClass does not support it, we set an error status on the AWX CR, or make it a noop and intentionally exclude the storage request value change.
- We could potentially key off of the presence and value of the storageclass.allowVolumeExpansion field; but we would also need to know the default storageclass provided by the cluster, or at least the storageclass used to create the existing PVC, because
storage_class
on the AWX spec is now a required field.
- We could potentially key off of the presence and value of the storageclass.allowVolumeExpansion field; but we would also need to know the default storageclass provided by the cluster, or at least the storageclass used to create the existing PVC, because
What do you think @tylergmuir ? Can you think of any other considerations? Does what I said above make sense/align with what you've seen experimentally?
Also, if you or anyone else has a good idea of how this could work and has time, a PR would be welcome.
@rooftopcellist I believe you have it all right. In my case, I had a PVC that used a storage class that did support being expanded. So all I had to do to get back to a working state was delete the StatefulSet and the rest of the existing code in the Operator handled resizing the PVC, creating the StatefulSet using that resized PVC, and building the pods on top of that.
The main issue that I ran into was that by changing the postgres_storage_requirements
it consumed the change, brought down the Postgres pod but then get stuck in a loop of trying to update the StatefulSet but fail to update the StatefulSet (due to the storage in StatefulSets being immutable).
But like you mentioned, in the case the user had a storage class that wasn't resizable, we would need some way of nicely stopping the process from starting to protect the service from being taken down to wait for an resize of the PVC that won't ever happen.
Great description @tylergmuir, I had the same issue of my postgres statefulset not spinning up and the same symptoms with conditional (create_statefulset_result.error == 422): 'dict object' has no attribute 'error'. 'dict object' has no attribute 'error'.
I was worried about deleting the statefulset and losing the PVC, but since theres no explicit retention policy defined, the PVC remained up after deleting my statefulset, and then a new statefulset spun up. Thank you!