docker-mailserver-helm
docker-mailserver-helm copied to clipboard
Multiple replicas and a PVC RWO can't work
I was reviewing the code of the chart. I found that the default values have 2 replicas, but a PVC with ReadWriteOnce, which can't work.
The PVC using the default values will be bound to the first pod that starts, but the second pod will fail to start, because it won't be able to bind the volume.
The question is, if it is safe enough to run multiple pods that share a RWX PVC?
That's a good point. And I don't know the answer. Maybe @polarathene or @georglauterbach would know. Agreed replica count should be 1 by default.
The question is, if it is safe enough to run multiple pods that share a RWX PVC?
It seems to be not safe. So the whole setup should be a single StatefulSet instead of deployment. Actually because it can safely work only with a single replica, it does not matter if it is a sts or a deploy.
https://doc.dovecot.org/configuration_manual/replication/
Though, the approach described above seems to not be the best, and there is another approach by running multiple Dovecot backends somehow:
Replication works only between server pairs. If you have a large cluster, you need multiple independently functioning Dovecot backend pairs
Maybe we can implement it in DMS somehow?
Right, if you wanted to use dovecot replication I agree a stateful set would be the right approach.
An alternative is you could use Dovecot Director:
https://doc.dovecot.org/admin_manual/director/dovecotdirector/
I would hope in that case a replicaset would work because the same user is always directed to the same dovecot instance. So each dovecot instance would be writing to different MailDirs, even on the same PV (so ReadWriteMany).
Back to your original your question. A ReadWriteOnce volume can be shared between multiple pods running on the same node. So is that safe?
You are right, Dovecot Director seems to be the right approach. Maybe we should open a feature request in the main repo to discuss its implementation.
The issue with RWO is that all pods must be on the same node, though as documented by Dovecot this approach is not safe. Having all pods on the same node might make some sense, but it is mostly not the intended goal. In mail server you wish to distribute your mail servers to different nodes to stay always online as far as possible. However, running all pods on the same node will not server that.
The ReadWriteOnce access mode restricts volume access to a single node, which means it is possible for multiple pods on the same node to read from and write to the same volume. This could potentially be a major problem for some applications, especially if they require at most one writer for data safety guarantees.
Maybe @polarathene
I don't have k8s expertise to chime in much here.
The replication part has been requested on the main DMS repo to be supported, but we don't have any proper support for that in place AFK. There's also gotchas involved with that AFAIK depending on how volumes are managed, especially if NFS is involved.
Maybe we should open a feature request in the main repo to discuss its implementation.
https://github.com/docker-mailserver/docker-mailserver/issues/2048
AFAIK, they contributor is waiting on an LDAP support improvement, I rejected their approach in favor of mine, but that's been blocked until I've finished the new LDAP docs to support the refactor 😓
The replication part has been requested on the main DMS repo to be supported, but we don't have any proper support for that in place AFK.
NFS would be an issue if you try to run a deployment that has multiple ReplicaSets and trying to access a PVC (persistence volume claim) using RWX mode. In this case NFS will be used to allow this read write many approach, which might end up with mail duplications, as stated by the official docs of Dovecot.
Warning Shared folder replication doesn’t work correctly. Mainly it can generate a lot of duplicate emails. This is because there’s currently a per-user lock that prevents multiple dsyncs from working simultaneously on the same user. But with shared folders multiple users can be syncing the same folder. So this would need additional locks (e.g. shared folders would likely need to lock the owner user, and public folders would likely need a per-folder lock or maybe a global public folder lock). There are no plans to fix this.
That is why replication approach is needed at all, and not the direct share between pods.
I will check with the guy in that issue. I may help with it if any of you would like to implement it with me. Implementing replication would be nice, but the director approach would be yet nicer, as it is the one that is really scalable. So I would spend time on implementing the nicer approach.
AFAIK, they contributor is waiting on an LDAP support improvement, I rejected their approach in favor of mine, but that's been blocked until I've finished the new LDAP docs to support the refactor 😓
I understand, yes it makes sense to check the changes after pushing your docs to reflect the LDAP changes.
Let us see what will come out of the discussion there.
I have been thinking about it, it might not be that hard to implement, if I am not missing anything. I have the idea that we can use the ingress to direct the same user always to the same pod. This approach is used also when dealing with other collaboration tools. By this, we can guarantee that no changes might be done or synched twice, because user will always use the same Dovecot instance.
By this, we can safely use the PVC in RWX (with NFS) mode.
I am not sure if it is that simple, but that is what came to my mind.
I haven't read the documentation in Dovecot about this (maybe they are doing nothing else than this), but at least this is an option that I know is being used to solve similar issues in similar use cases.