postgresql-container PostgreSQL slaves should handle restarts properly and not delete userdata

Currently, we have a problem that a restarting slave occasionally doesn't start up properly. The quick fix was rm -rf'ing userdata, so a started PostgreSQL process will replicate it anew from the master. It seems to me we could do better (and @praiskup didn't like our current solution). I asked our storage team, and they are of the opinion that making OpenShift/Kubernetes do the EmptyDir cleanup would not make much sense, since EmptyDirs are pod-specific.

Viable options that come to mind:

Don't use a volume for the slave data, just a plain container directory. This might be problematic, since we might not be able to do quota space accounting later. It would be a very quick and easy solution, though, without risk of deleting user's data (unless he really wants to)
Be intelligent about the deletion. Include some sort of "state" file that would make guarantees about the last known state of the data. This would be harder to implement and do right, but would provide additional benefit of reusing the data and saving some bandwidth.

cc @bparees @hhorak @mfojtik @rhcarvalho

Nov 23 '15 17:11 mnagy

Don't use a volume for the slave data, just a plain container directory. This might be problematic, since we might not be able to do quota space accounting later. It would be a very quick and easy solution, though, without risk of deleting user's data (unless he really wants to)

glad you listed this for completeness, but we definitely don't want to encourage this pattern.

what's the risk of wiping out the emptydir on each startup?

Nov 23 '15 17:11 bparees

Well, @praiskup raised the possibility of accidentally mounting an incorrect volume and us wiping it without warning. I guess that it is an unlikely possibility, but I agree it is at the least a very bad practice. Also, it is an rm -rf in a bash script, which could contain any sort of bugs, so I'm not overly confident we will always do the right thing.

The least we could do would be to add a sort of .this_is_a_replica_volume and check for its existence before executing the rm.

Having a good intelligent way to delete stuff would be best in the long run, IMHO.

Nov 23 '15 18:11 mnagy

The least we could do would be to add a sort of .this_is_a_replica_volume and check for its existence before executing the rm.

sure that seems reasonable. only wipe it out if we created it.

Nov 23 '15 18:11 bparees

What about introducing a variable CLEAN_DATADIR that would be set in kubernetes and the script would only delete the directory if this variable is set?

Nov 23 '15 20:11 hhorak

I don't love the idea of another environment variable, to be honest.. And the solution doesn't feel right. If you forget to set the variable, you will get weird errors. And it requires a documentation. If we have a simple dot file in the volume, we can be almost certain it's ok to delete it.

Nov 23 '15 20:11 mnagy

+1. an env variable isn't a terrible idea, but just making the default image behavior "correct" seems like a better experience.

Nov 23 '15 20:11 bparees

Who is responsible for creating .this_is_a_replica_volume file? Note that If you have any file within '/var/lib/pgsql/data', initdb refuses to continue.

Nov 24 '15 05:11 praiskup

The same entrypoint script that now deletes the data. If that's the case, we'll just need to use /var/lib/pgsql/data/userdata and put the file one level above that.

Nov 24 '15 06:11 mnagy

Sounds fine. Possibly chance to do something with https://github.com/openshift/postgresql/blob/master/9.4/root/usr/share/container-scripts/postgresql/common.sh#L165 ?

Nov 24 '15 07:11 praiskup

Can you elaborate?

Nov 24 '15 12:11 mnagy

See the 'TODO' note in the cited code. If you plan to put something into /var/lib/pgsql/data (the dot-this-is-a-replica-sth) then it does not make sense to expect /var/lib/pgsql/data will ever be used as $PGDATA for initdb. Thus -- you should make sure /var/lib/pgsql/data/SUBDIR is always used.

Nov 24 '15 12:11 praiskup

Yes, that's correct.

Nov 24 '15 13:11 mnagy