helm-charts
helm-charts copied to clipboard
[tempo-distributed] add /var/tempo emptyDir mount for querier, queryfrontend and compactor
When running with hardened securitycontext, querier, queryfrontend and compactor will fail to start with following error:
level=error ts=2022-08-07T13:47:47.444953646Z caller=main.go:109 msg="error running Tempo" err="failed to init module services error initialising module: store: failed to create store mkdir /var/tempo: permission denied"
Or read-only filesystem if you have
readOnlyRootFilesystem: true
This PR adds emptydir to that components with mounted /var/tempo, which allows them to start with non-root user.
However, there is still issue with following settings:
readOnlyRootFilesystem: true
In tempo-query-frontend pod, in tempo-query container, following error is present:
2022-08-07T14:09:06.263Z [ERROR] tempo-query: plugin init error: error="open /tmp/plugin164587256: read-only file system" timestamp=2022-08-07T14:09:06.263Z
Open to suggestions. /var/tempo is ephemeral anyway since it was writing in-pod, so switching it to emptyDir should pose no issues?
I like this improvement, but I think we need to mention the change in the readme. It may impact other deploys.
Also, as discussed in the issue linked above, how do you feel about defaulting to false for tempo-query? (would also require a note in the readme)
Should we also set default securitycontext, like it's done in mimir and loki charts? I run mine with
- tempo:
securityContext:
capabilities:
drop:
- ALL
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
The only component with PV is ingester, and with changing user UIDs it might require manual intervention to change files ownership. However, ingester persistence is disabled by default. All other components should be fine.
I like the idea of adding the default security context. Should we do that in a different PR? or does it make sense along with these changes?
In tempo-query-frontend pod, in tempo-query container, following error is present:
2022-08-07T14:09:06.263Z [ERROR] tempo-query: plugin init error: error="open /tmp/plugin164587256: read-only file system" timestamp=2022-08-07T14:09:06.263Z
Wouldn't that be solved by adding an emptyDir mount for /tmp in that case?
Should we also set default securitycontext, like it's done in mimir and loki charts? I run mine with
- tempo: securityContext: capabilities: drop: - ALL readOnlyRootFilesystem: true runAsNonRoot: true runAsUser: 1000 runAsGroup: 1000
The only component with PV is ingester, and with changing user UIDs it might require manual intervention to change files ownership. However, ingester persistence is disabled by default. All other components should be fine.
You might also want to set the fsGroup
to avoid issues with the PV
In tempo-query-frontend pod, in tempo-query container, following error is present:
2022-08-07T14:09:06.263Z [ERROR] tempo-query: plugin init error: error="open /tmp/plugin164587256: read-only file system" timestamp=2022-08-07T14:09:06.263Z
Wouldn't that be solved by adding an emptyDir mount for /tmp in that case?
No, because grpc plugin also resides in /tmp and due to how it works: https://github.com/grafana/tempo/issues/1621 https://github.com/grafana/tempo/blob/main/cmd/tempo-query/Dockerfile Plus grafana 7 is seemingly EOL now?
I like the idea of adding the default security context. Should we do that in a different PR? or does it make sense along with these changes?
Probably makes sense adding to that PR too, since now it's compatible with full set of rules.
You might also want to set the
fsGroup
to avoid issues with the PV
Yeah, probably should do the trick in majority of cases, but i vaguely recall that not every CSI supports fsGroup, so probably still worth mentioning in changelog.
seems like i need to split podSecurityContext and containerSecurityContext
Since fsGroup can't be used at container level, i added podSecurityContext for fsGroup. But since we now have pod-wide securityContext, maybe it will be better to move other settings to top level?
Rebased. @joe-elliott can you talk a look please?
I am good on this change. @rlex I appreciate your passion for security. This is not the first PR/issue you've filed where you've pushed for best practices in Tempo.
@zanhsieh currently has a block on merging b/c he's requesting changes. They will need to lift the block. @BitProcessor did you have any additional thoughts?
Might be nice idea to kick pipelines again. I'm not sure if resolving conflicts via github UI triggers builds.
Added /var/tempo emptydir to distirbutor as well - noticed it's required for 1.5. Shouldn't hurt.
@joe-elliott i reverted adding store to distributor as discussed here https://github.com/grafana/tempo/issues/1648
Should be ready to merge now, i think.
While at it, we might as well also add:
automountServiceAccountToken: false
enableServiceLinks: false
But as @joe-elliott mentioned earlier, it's a lot of changes that have nothing to do with the initial goal of the PR: "add /var/tempo emptyDir mount for querier, queryfrontend and compactor" (They are all related changes, but wondering if we should split it into multiple PR's)
Well we can just rename it to "security hardening of tempo-distributed", maybe. Initial adding of /var/tempo emptyDir was done to make it compatible with security best practices, so, IMO, makes sense to add best practices to same PR. (especially since other charts, like loki and mimir, already comes with preconfigured securityContext and etc)
Was just discussing it with Elliot ; he said the same thing :D How about my 2 suggestions ?
Sure, why not. Should it be configurable, or static will be enough?
Sure, why not. Should it be configurable, or static will be enough?
personally I would go for static - I don't see the added value of making it configurable.
Heads up that I'm about to merge this and you will need to update your chart version. Thanks! https://github.com/grafana/helm-charts/pull/1699
Interesting, now it fails after bump to 1.5.0.
Ingester fails to assemble ring?
@joe-elliott in distributor:
level=error ts=2022-08-17T18:57:58.684074925Z caller=main.go:109 msg="error running Tempo" err="failed to init module services error initialising module: usage-report: failed to initialize usage report: mkdir /var/tempo: read-only file system"
https://github.com/grafana/helm-charts/runs/7885580755?check_suite_focus=true#step:9:284
That commit about fixing calling store for usage-report went into final 1.5.0 release?
UPD: yeah looks like it's here https://github.com/grafana/tempo/blob/v1.5.0/cmd/tempo/app/modules.go but still getting called?
Usage reporting uses the "backend" to store a seed file that all the components use to know what cluster they belong to. So in a normal distributed setup the distributor will not need access to its local disk b/c it will grab the seed file from the object storage backend.
However, this chart defaults to using local storage as the backend. So on startup the distributor is going to its "backend" to find or create the seed file. Unfortunately that is local disk.
I can add emptyDir back for distributor. Kludge for sure, but will work.
Another way will be to test it against something like minio, but that's completely different story.
I can add emptyDir back for distributor. Kludge for sure, but will work.
Yeah, I think that's the best choice for now. A minio testing setup would be a much bigger lift.
It looks like github is acting finicky now, throwing random 503s and failing pipelines
let's try close and reopen this PR to see if CI got fixed.
Yep, looks like that did the trick.
Merging time! Thanks for your contribution @rlex 🚀