community
community copied to clipboard
TEP-0086: Larger Results via Sidecar Logs
We propose that we provide the multiple solutions, all guarded behind a larger-results
feature flag, so that we can experiment and figure out a way forward. These gated solutions will be alpha, and can be changed or removed at any time.
In this change, we propose experimenting with Sidecar Logs as a solution for providing larger Results within the CRDs. This will be enabled by setting larger-results
: "sidecar-logs"
.
This solution can be changed at any time, or removed completely. This will give us an opportunity to gather user feedback and find ways to address the concerns, as we figure out a way forward.
Many thanks to @chitrangpatel for implementing the proof of concept - demo.
/kind tep
/assign @pritidesai
/cc @wlynch
/assign
@tlawrie @imjasonh @tomcli @ScrapCodes - thought this may be of interest to you, please take a look
/assign @dibyom
Hi @jerop I think its fantastic that a PoC was achieved.
we propose moving forward with Sidecar Logs as the solution for providing larger Results within the CRDs. IMO, there are a number of concerns with the sidecar log tailing approach that I think would put pause on proposing this as the final solution, considering the alternatives listed in the TEP.
I suggest either waiting for the other PoC is finished, or alternatively continuing with a new PoC to determine if one of the other alternatives may be better suited to solving this challenge.
Hi @jerop I think its fantastic that a PoC was achieved.
we propose moving forward with Sidecar Logs as the solution for providing larger Results within the CRDs. IMO, there are a number of concerns with the sidecar log tailing approach that I think would put pause on proposing this as the final solution, considering the alternatives listed in the TEP.
I suggest either waiting for the other PoC is finished, or alternatively continuing with a new PoC to determine if one of the other alternatives may be better suited to solving this challenge.
@tlawrie didn't mean this as the only solution, intended this as a solution that we provide behind a feature flag so that we can experiment with it and find ways to address the concerns - and that we can do the same with the PoCs when they are ready - so that we can have tangible results to help us move this work forward, what do you think?
clarified the language to reflect that this is an experiment, and we can provide the other solutions behind feature gates as well - and that they can be changed or removed at any time
didn't mean this as the only solution, intended this as a solution that we provide behind a feature flag so that we can experiment with it and find ways to address the concerns - and that we can do the same with the PoCs when they are ready - so that we can have tangible results to help us move this work forward, what do you think?
If we're not yet sure which solution we are going to land on as the primary answer long term, perhaps we should just leave this under alternatives? By moving this idea out of alternatives and into a top level proposal section, my reaction was that we were selecting this approach over the other alternatives.
I don't think any of this blocks experimentation, and if we didn't want to make changes to the upstream controller until we're settled on a design I think this could be implemented without any changes to the Pipelines controller if we really wanted.
If we're not yet sure which solution we are going to land on as the primary answer long term, perhaps we should just leave this under alternatives?
@wlynch happy to move it to alternatives - what I care the most about is if we can make them implementable while gated behind feature flags so that we can experiment - what do you think?
@wlynch @tlawrie @pritidesai @dibyom - moved the solution to the alternatives section and left the proposal to experiment behind feature gates - please take a look
@wlynch happy to move it to alternatives - what I care the most about is if we can make them implementable while gated behind feature flags so that we can experiment - what do you think?
I personally don't mind experimenting even without the implementable bit so long as the feature doesn't require significant rearchitecting. If we want to be more conservative, pretty sure we could also implement this with a mutating Pod admission webhook and just inject the sidecar / entrypoint without making changes to the Pipelines controller (though this won't work once SPIRE support lands). 🤷 The other alternative is we just experiment in a fork until we have more confidence.
The other alternative is we just experiment in a fork until we have more confidence.
@wlynch it is hard to experiment and gather feedback from forks where visibility is limited. this is why we experimented with the alternatives for remote resolution (TEP-0060) behind feature flags to figure out a way forward before we chose the solution - https://github.com/tektoncd/pipeline/pull/4168. the same experimentation approach i'm suggesting here. i believe this kind of experimentation to get tangible feedback from users and dogfooding/fishfooding is what will give us more confidence.
If we want to be more conservative, pretty sure we could also implement this with a mutating Pod admission webhook and just inject the sidecar / entrypoint without making changes to the Pipelines controller (though this won't work once SPIRE support lands). 🤷
I think we still need the controller changes for parsing the logs.
I don't think any of this blocks experimentation, and if we didn't want to make changes to the upstream controller until we're settled on a design I think this could be implemented without any changes to the Pipelines controller if we really wanted.
I'm assuming in this approach instead of the pipelinerun controller parsing the logs, it is done by a separate controller?
@wlynch it is hard to experiment and gather feedback from forks where visibility is limited. this is why we experimented with the alternatives for remote resolution (TEP-0060) behind feature flags to figure out a way forward before we chose the solution - https://github.com/tektoncd/pipeline/pull/4168. the same experimentation approach i'm suggesting here. i believe this kind of experimentation to get tangible feedback from users and dogfooding/fishfooding is what will give us more confidence.
+1. I agree that while we have to be careful around not rearchitecting too much for an experiment, I think its worthwhile making it easy to try out the feature and gather feedback, and decide on the way forward.
In terms of marking the TEP implementable or not - personally I'm ok with experimenting without the implementable flag given that we haven't selected the way forward.
[APPROVALNOTIFIER] This PR is NOT APPROVED
This pull-request has been approved by: To complete the pull request process, please ask for approval from dibyom after the PR has been reviewed.
The full list of commands accepted by this bot can be found here.
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
@dibyom @pritidesai @vdemeester please take a look
cc @wlynch @tlawrie @imjasonh @chitrangpatel
Thanks @jerop
Have we done any measurements around the overhead of 1. adding a sidecar to each taskRun that requires a result and 2. having the reconciler parse potentially a few MBs worth of results in the controller itself
@dibyom I was supposed to do that but I completely missed it. When you say overhead
, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.
@dibyom I was supposed to do that but I completely missed it. When you say overhead, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.
No worries :) I think pod/run startup time would also be a good thing to measure
@dibyom I was supposed to do that but I completely missed it. When you say overhead, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.
No worries :) I think pod/run startup time would also be a good thing to measure
@wlynch @dibyom @jerop I did the overhead tests (pipeline controller's CPU, Memory and Pod/Task startup time)
.
For the test, I spawned 20-30 pipeline runs, each with 3 tasks (each with two steps) and gathered metrics from the metrics server
during the tests. I ran it with and without sidecar logs.
Here are the results:
Average pipeline controller's CPU difference during pipeline run: 1%
Average pipeline controller's Memory usage difference during pipeline run: 0.2%
Average Pod Startup time (time to get to running state) difference: 3 s per task (I think this is the most significant difference)
overhead with sidecar logs
The figure below shows measurements of CPU and Memory usage of the control place taken every 10 s while the controller was running and spawning pipeline runs with sidecar logs enabled.
overhead without sidecar logs
The figure below shows measurements of CPU and Memory usage of the control place taken every 10 s while the controller was running and spawning pipeline runs with sidecar logs disabled.
overhead pod startup times
The figure below shows measurements of start up times of ~ 50 pods where startup times is the time to get to Running
state while monitoring on k9s
. The black
points and lines are measurements when spawning pods without the results sidecar while red
shows measurements when we do have results sidecar being spawned along side the steps.
Let me know if anything is unclear 🙂. I can share the scripts I made to do these measurements.
/assign @afrittoli
@chitrangpatel Thank you so much for the performance benchmarking numbers. The extra 3s overhead to startup time per taskrun is a bit concerning. Wondering - does the latency go up if say you increase the load (e.g. run 50 pipelineruns concurrently).
From a mitigation standpoint, it might help if we only inject the sidecars if we know we the task is writing a result.
@chitrangpatel Thank you so much for the performance benchmarking numbers. The extra 3s overhead to startup time per taskrun is a bit concerning. Wondering - does the latency go up if say you increase the load (e.g. run 50 pipelineruns concurrently).
Yes, I was very concerned about the 3 s overhead too. Its an increase by ~50% from the average ~6 s overhead without the results sidecar. I can check the times when I increase the load. Although I will have to figure out a way to accurately measure start-up times instead of eye-balling the k9s screen 50 times (like I did last time 😄 )
From a mitigation standpoint, it might help if we only inject the sidecars if we know we the task is writing a result.
Yes, this is already the case. The task run reconciler will check if the steps in the task are producing any results and only inject a results sidecar in that case.
I can check the times when I increase the load. Although I will have to figure out a way to accurately measure start-up times instead of eye-balling the k9s screen 50 times (like I did last time 😄 )
Given that this is a prototype I think that's ok - we can use these numbers as a baseline to compare to
API WG - on agenda for further discussion
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
with a justification.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen
with a justification.
/lifecycle stale
Send feedback to tektoncd/plumbing.
opened a new TEP instead of updating TEP-0086 - https://github.com/tektoncd/community/pull/887 - so closing this PR