PostgreSQL slaves should handle restarts properly and not delete userdata
Currently, we have a problem that a restarting slave occasionally doesn't start up properly. The quick fix was rm -rf'ing userdata, so a started PostgreSQL process will replicate it anew from the master. It seems to me we could do better (and @praiskup didn't like our current solution). I asked our storage team, and they are of the opinion that making OpenShift/Kubernetes do the EmptyDir cleanup would not make much sense, since EmptyDirs are pod-specific.
Viable options that come to mind:
- Don't use a volume for the slave data, just a plain container directory. This might be problematic, since we might not be able to do quota space accounting later. It would be a very quick and easy solution, though, without risk of deleting user's data (unless he really wants to)
- Be intelligent about the deletion. Include some sort of "state" file that would make guarantees about the last known state of the data. This would be harder to implement and do right, but would provide additional benefit of reusing the data and saving some bandwidth.
cc @bparees @hhorak @mfojtik @rhcarvalho
Don't use a volume for the slave data, just a plain container directory. This might be problematic, since we might not be able to do quota space accounting later. It would be a very quick and easy solution, though, without risk of deleting user's data (unless he really wants to)
glad you listed this for completeness, but we definitely don't want to encourage this pattern.
what's the risk of wiping out the emptydir on each startup?
Well, @praiskup raised the possibility of accidentally mounting an incorrect volume and us wiping it without warning. I guess that it is an unlikely possibility, but I agree it is at the least a very bad practice. Also, it is an rm -rf in a bash script, which could contain any sort of bugs, so I'm not overly confident we will always do the right thing.
The least we could do would be to add a sort of .this_is_a_replica_volume and check for its existence before executing the rm.
Having a good intelligent way to delete stuff would be best in the long run, IMHO.
The least we could do would be to add a sort of .this_is_a_replica_volume and check for its existence before executing the rm.
sure that seems reasonable. only wipe it out if we created it.
What about introducing a variable CLEAN_DATADIR that would be set in kubernetes and the script would only delete the directory if this variable is set?
I don't love the idea of another environment variable, to be honest.. And the solution doesn't feel right. If you forget to set the variable, you will get weird errors. And it requires a documentation. If we have a simple dot file in the volume, we can be almost certain it's ok to delete it.
+1. an env variable isn't a terrible idea, but just making the default image behavior "correct" seems like a better experience.
Who is responsible for creating .this_is_a_replica_volume file? Note that If you have any file within '/var/lib/pgsql/data', initdb refuses to continue.
The same entrypoint script that now deletes the data. If that's the case, we'll just need to use /var/lib/pgsql/data/userdata and put the file one level above that.
Sounds fine. Possibly chance to do something with https://github.com/openshift/postgresql/blob/master/9.4/root/usr/share/container-scripts/postgresql/common.sh#L165 ?
Can you elaborate?
See the 'TODO' note in the cited code. If you plan to put something into /var/lib/pgsql/data (the dot-this-is-a-replica-sth) then it does not make sense to expect /var/lib/pgsql/data will ever be used as $PGDATA for initdb. Thus -- you should make sure /var/lib/pgsql/data/SUBDIR is always used.
Yes, that's correct.