community icon indicating copy to clipboard operation
community copied to clipboard

TEP-0086: Larger Results via Sidecar Logs

Open jerop opened this issue 2 years ago • 24 comments

We propose that we provide the multiple solutions, all guarded behind a larger-results feature flag, so that we can experiment and figure out a way forward. These gated solutions will be alpha, and can be changed or removed at any time.

In this change, we propose experimenting with Sidecar Logs as a solution for providing larger Results within the CRDs. This will be enabled by setting larger-results: "sidecar-logs".

This solution can be changed at any time, or removed completely. This will give us an opportunity to gather user feedback and find ways to address the concerns, as we figure out a way forward.

Many thanks to @chitrangpatel for implementing the proof of concept - demo.

/kind tep

jerop avatar Jun 28 '22 16:06 jerop

/assign @pritidesai

jerop avatar Jun 28 '22 16:06 jerop

/cc @wlynch

wlynch avatar Jun 28 '22 16:06 wlynch

/assign

pritidesai avatar Jun 28 '22 16:06 pritidesai

@tlawrie @imjasonh @tomcli @ScrapCodes - thought this may be of interest to you, please take a look

jerop avatar Jun 28 '22 16:06 jerop

/assign @dibyom

jerop avatar Jun 28 '22 19:06 jerop

Hi @jerop I think its fantastic that a PoC was achieved.

we propose moving forward with Sidecar Logs as the solution for providing larger Results within the CRDs. IMO, there are a number of concerns with the sidecar log tailing approach that I think would put pause on proposing this as the final solution, considering the alternatives listed in the TEP.

I suggest either waiting for the other PoC is finished, or alternatively continuing with a new PoC to determine if one of the other alternatives may be better suited to solving this challenge.

tlawrie avatar Jun 29 '22 06:06 tlawrie

Hi @jerop I think its fantastic that a PoC was achieved.

we propose moving forward with Sidecar Logs as the solution for providing larger Results within the CRDs. IMO, there are a number of concerns with the sidecar log tailing approach that I think would put pause on proposing this as the final solution, considering the alternatives listed in the TEP.

I suggest either waiting for the other PoC is finished, or alternatively continuing with a new PoC to determine if one of the other alternatives may be better suited to solving this challenge.

@tlawrie didn't mean this as the only solution, intended this as a solution that we provide behind a feature flag so that we can experiment with it and find ways to address the concerns - and that we can do the same with the PoCs when they are ready - so that we can have tangible results to help us move this work forward, what do you think?

clarified the language to reflect that this is an experiment, and we can provide the other solutions behind feature gates as well - and that they can be changed or removed at any time

jerop avatar Jun 29 '22 12:06 jerop

didn't mean this as the only solution, intended this as a solution that we provide behind a feature flag so that we can experiment with it and find ways to address the concerns - and that we can do the same with the PoCs when they are ready - so that we can have tangible results to help us move this work forward, what do you think?

If we're not yet sure which solution we are going to land on as the primary answer long term, perhaps we should just leave this under alternatives? By moving this idea out of alternatives and into a top level proposal section, my reaction was that we were selecting this approach over the other alternatives.

I don't think any of this blocks experimentation, and if we didn't want to make changes to the upstream controller until we're settled on a design I think this could be implemented without any changes to the Pipelines controller if we really wanted.

wlynch avatar Jun 29 '22 13:06 wlynch

If we're not yet sure which solution we are going to land on as the primary answer long term, perhaps we should just leave this under alternatives?

@wlynch happy to move it to alternatives - what I care the most about is if we can make them implementable while gated behind feature flags so that we can experiment - what do you think?

jerop avatar Jun 29 '22 15:06 jerop

@wlynch @tlawrie @pritidesai @dibyom - moved the solution to the alternatives section and left the proposal to experiment behind feature gates - please take a look

jerop avatar Jun 29 '22 15:06 jerop

@wlynch happy to move it to alternatives - what I care the most about is if we can make them implementable while gated behind feature flags so that we can experiment - what do you think?

I personally don't mind experimenting even without the implementable bit so long as the feature doesn't require significant rearchitecting. If we want to be more conservative, pretty sure we could also implement this with a mutating Pod admission webhook and just inject the sidecar / entrypoint without making changes to the Pipelines controller (though this won't work once SPIRE support lands). 🤷 The other alternative is we just experiment in a fork until we have more confidence.

wlynch avatar Jun 29 '22 15:06 wlynch

The other alternative is we just experiment in a fork until we have more confidence.

@wlynch it is hard to experiment and gather feedback from forks where visibility is limited. this is why we experimented with the alternatives for remote resolution (TEP-0060) behind feature flags to figure out a way forward before we chose the solution - https://github.com/tektoncd/pipeline/pull/4168. the same experimentation approach i'm suggesting here. i believe this kind of experimentation to get tangible feedback from users and dogfooding/fishfooding is what will give us more confidence.

jerop avatar Jun 29 '22 18:06 jerop

If we want to be more conservative, pretty sure we could also implement this with a mutating Pod admission webhook and just inject the sidecar / entrypoint without making changes to the Pipelines controller (though this won't work once SPIRE support lands). 🤷

I think we still need the controller changes for parsing the logs.

I don't think any of this blocks experimentation, and if we didn't want to make changes to the upstream controller until we're settled on a design I think this could be implemented without any changes to the Pipelines controller if we really wanted.

I'm assuming in this approach instead of the pipelinerun controller parsing the logs, it is done by a separate controller?

dibyom avatar Jun 30 '22 21:06 dibyom

@wlynch it is hard to experiment and gather feedback from forks where visibility is limited. this is why we experimented with the alternatives for remote resolution (TEP-0060) behind feature flags to figure out a way forward before we chose the solution - https://github.com/tektoncd/pipeline/pull/4168. the same experimentation approach i'm suggesting here. i believe this kind of experimentation to get tangible feedback from users and dogfooding/fishfooding is what will give us more confidence.

+1. I agree that while we have to be careful around not rearchitecting too much for an experiment, I think its worthwhile making it easy to try out the feature and gather feedback, and decide on the way forward.

In terms of marking the TEP implementable or not - personally I'm ok with experimenting without the implementable flag given that we haven't selected the way forward.

dibyom avatar Jun 30 '22 21:06 dibyom

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please ask for approval from dibyom after the PR has been reviewed.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot avatar Aug 01 '22 16:08 tekton-robot

@dibyom @pritidesai @vdemeester please take a look

cc @wlynch @tlawrie @imjasonh @chitrangpatel

jerop avatar Aug 02 '22 21:08 jerop

Thanks @jerop

Have we done any measurements around the overhead of 1. adding a sidecar to each taskRun that requires a result and 2. having the reconciler parse potentially a few MBs worth of results in the controller itself

@dibyom I was supposed to do that but I completely missed it. When you say overhead, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.

chitrangpatel avatar Aug 04 '22 14:08 chitrangpatel

@dibyom I was supposed to do that but I completely missed it. When you say overhead, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.

No worries :) I think pod/run startup time would also be a good thing to measure

dibyom avatar Aug 04 '22 19:08 dibyom

@dibyom I was supposed to do that but I completely missed it. When you say overhead, you just mean in terms of memory and CPU footprint of the pods right? If so, I can start gathering metrics right away.

No worries :) I think pod/run startup time would also be a good thing to measure

@wlynch @dibyom @jerop I did the overhead tests (pipeline controller's CPU, Memory and Pod/Task startup time). For the test, I spawned 20-30 pipeline runs, each with 3 tasks (each with two steps) and gathered metrics from the metrics server during the tests. I ran it with and without sidecar logs. Here are the results:

Average pipeline controller's CPU difference during pipeline run: 1% 
Average pipeline controller's Memory usage difference during pipeline run: 0.2% 
Average Pod Startup time (time to get to running state) difference: 3 s per task (I think this is the most significant difference)

overhead with sidecar logs

The figure below shows measurements of CPU and Memory usage of the control place taken every 10 s while the controller was running and spawning pipeline runs with sidecar logs enabled.

overhead_with_sidecar_logs

overhead without sidecar logs

The figure below shows measurements of CPU and Memory usage of the control place taken every 10 s while the controller was running and spawning pipeline runs with sidecar logs disabled.

overhead_without_sidecar_logs

overhead pod startup times

The figure below shows measurements of start up times of ~ 50 pods where startup times is the time to get to Running state while monitoring on k9s. The black points and lines are measurements when spawning pods without the results sidecar while red shows measurements when we do have results sidecar being spawned along side the steps.

Pod_Startup_times

Let me know if anything is unclear 🙂. I can share the scripts I made to do these measurements.

chitrangpatel avatar Aug 05 '22 17:08 chitrangpatel

/assign @afrittoli

afrittoli avatar Aug 08 '22 16:08 afrittoli

@chitrangpatel Thank you so much for the performance benchmarking numbers. The extra 3s overhead to startup time per taskrun is a bit concerning. Wondering - does the latency go up if say you increase the load (e.g. run 50 pipelineruns concurrently).

From a mitigation standpoint, it might help if we only inject the sidecars if we know we the task is writing a result.

dibyom avatar Aug 08 '22 16:08 dibyom

@chitrangpatel Thank you so much for the performance benchmarking numbers. The extra 3s overhead to startup time per taskrun is a bit concerning. Wondering - does the latency go up if say you increase the load (e.g. run 50 pipelineruns concurrently).

Yes, I was very concerned about the 3 s overhead too. Its an increase by ~50% from the average ~6 s overhead without the results sidecar. I can check the times when I increase the load. Although I will have to figure out a way to accurately measure start-up times instead of eye-balling the k9s screen 50 times (like I did last time 😄 )

From a mitigation standpoint, it might help if we only inject the sidecars if we know we the task is writing a result.

Yes, this is already the case. The task run reconciler will check if the steps in the task are producing any results and only inject a results sidecar in that case.

chitrangpatel avatar Aug 08 '22 19:08 chitrangpatel

I can check the times when I increase the load. Although I will have to figure out a way to accurately measure start-up times instead of eye-balling the k9s screen 50 times (like I did last time 😄 )

Given that this is a prototype I think that's ok - we can use these numbers as a baseline to compare to

dibyom avatar Aug 10 '22 15:08 dibyom

API WG - on agenda for further discussion

pritidesai avatar Aug 29 '22 16:08 pritidesai

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Nov 27 '22 17:11 tekton-robot

opened a new TEP instead of updating TEP-0086 - https://github.com/tektoncd/community/pull/887 - so closing this PR

jerop avatar Nov 30 '22 20:11 jerop