mimir
mimir copied to clipboard
Docs: Clarify why it's recommended deploy compactor as StatefulSet
Is your documentation request related to a feature? If so, which one?
I'd like a clarification of the fact that we recommend deploying the compactor through a StatefulSet, even though it's stateless by nature. At the moment, the documentation only states that "the compactor is stateless".
Describe the solution that you’d like or the expected outcome
I would like for our compactor documentation to clarify that we recommend deploying it through a StatefulSet (as we do at Grafana Labs), even though it's fundamentally stateless. The actual reason why we (Grafana Labs) deploy it as a StatefulSet is to use a dedicated disk, in order to not affect the performance of the root filesystem (pointed out by @pracucci).
@osg-grafana Moving this to backlog to follow what Mimir squad is saying is their status (backlog)
Let me know if I understand the full picture before I start documenting. (cc @pracucci)
The compactor is stateless but it does quite a lot of disk I/O. We decided to deploy it as a StatefulSet to mount a dedicated volume for each replica, so that the compactor disk I/O doesn't impact the root filesystem.
The reason we use StatefulSets with compactor is that StatefulSets are configured with a persistent volume claim template and each replica gets its own persistent volume. If we used a vanilla deployment, each pod would attempt to mount the same volume. (It wouldn't be templatized.)
And the point about the root filesystem is that we want a separate device for compactor-data
so we won't compete with the OS for the root filesystem. Is that right?
The reason we use StatefulSets with compactor is that StatefulSets are configured with a persistent volume claim template and each replica gets its own persistent volume. If we used a vanilla deployment, each pod would attempt to mount the same volume. (It wouldn't be templatized.)
No, this is not correct. If we use a Deployment, each pod would get their own isolated volume, but it would be backed by the node root filesystem. This means that an heavy I/O on the root filesystem done by the compactor may affect performance of other pods running on the same node. To isolate it, we use PVs.
The reason we use StatefulSets with compactor is that StatefulSets are configured with a persistent volume claim template and each replica gets its own persistent volume. If we used a vanilla deployment, each pod would attempt to mount the same volume. (It wouldn't be templatized.)
No, this is not correct. If we use a Deployment, each pod would get their own isolated volume, but it would be backed by the node root filesystem. This means that an heavy I/O on the root filesystem done by the compactor may affect performance of other pods running on the same node. To isolate it, we use PVs.
I think we are on the same page but there was some vagueness in what I said. You are right: an ephemeral volume by default would be a namespaced portion of the node's root FS and we don't want to create contention with the host OS or colocated pods. Therefore, we want to use a persistent volume claim to a dedicated remote storage service. Deployments allow PVCs just like StatefulSets, but doing so would have the pitfalls described here. (Let me know if we're still on different pages!)
Having said that and revisiting the k8s storage docs, I notice we could also be using 1.23's generic ephemeral volumes which would allow a pod to mount a dedicated remote block storage device with all the same characteristics of a regular ephemeral pod volume. PVCs are auto-created and auto-destroyed alongside the pods. (It seems like using GEVs would remove the constraints around running compactor as a StatefulSet. ~And would render moot the exploration around the newer StatefulSet PVC GC stuff underway: https://github.com/grafana/mimir-squad/issues/1962.~ (not quite render it moot - store-gateway still needs to be a StatefulSet.))
Having said that and revisiting the k8s storage docs, I notice we could also be using 1.23's generic ephemeral volumes which would allow a pod to mount a dedicated remote block storage device with all the same characteristics of a regular ephemeral pod volume.
Looks quite a bit work migrating compactors from StatefulSet to Deployment, and I'm not sure it's a priority right now (I personally don't see such value a particularly good investment of time). I would suggest to keep compactors as StatefulSet and would use StatefulSet PVC Auto-Deletion feature instead.