results
results copied to clipboard
switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test
rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
Changes
Fixes https://github.com/tektoncd/results/issues/720
alternative to https://github.com/tektoncd/results/pull/712 with regard to still addressing the goroutine/memory leak, swtiching from the tkn client to k8s client, but with the reduction of threads, allows for retry when appropriate on UpdateLog errors.
/kind bug
These changes address a goroutine/memory leak that existed with the prior main branch vesrion,
Additionally
- switch from tkn client to k8s pod log retrieval
- fix panic, add debug, to e2e_gcs_test
- rm tkn cli dependency
- go mod tidy, vendor
- add rbac for pod listing
- move streamLogs onto reconciler thread
- dump default worker pool thread count to 32 when logs are enabled
- switch from grpc CloseSend to CloseAndRecv to address intermittent canceled context errors
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you review them:
- [n/a ] Has Docs included if any changes are user facing
- [ n/a] Has Tests included if any functionality added or changed
- [ /] Tested your changes locally (if this is a code change)
- [ /] Follows the commit message standard
- [ /] Meets the Tekton contributor standards (including functionality, content, code)
- [/ ] Has a kind label. You can add a comment on this PR that contains
/kind <type>
. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep - [n/a ] Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
- [ n/a] Release notes contain the string "action required" if the change requires additional action from users switching to the new release
Release Notes
addresses UpdateLog calls from the watcher failing with canceled context, and allows for proper erro retry, while still fixing goroutine leak, which leads to overall memory leak, when storing logs
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.7% | -13.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.7% | -13.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.4% | -14.0 |
@sayan-biswas @avinal @enarha @ramessesii2 @adambkaplan FYI a first pass at replacing tkn cli with generic k8s get pod logs API calls
I could see this replacing #712 after a little more refining.... but curious on opinions / feedback
also really close to getting rid of launching the separate goroutine in sendLogs, but could also see deferring that to a follow pull request that also dives into verifying UpdateLogs completed in the API server to completely remove the intermittent canceled context UpdateLog instances we are seeing in Pavel's soak test. But I took a first pass at that today and that is going to require some unit test overall, as I've discovered the unit tests are not fully wired to return fake data.
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.7% | -13.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 61.7% | -7.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 61.5% | -7.9 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 61.5% | -7.9 |
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: sayan-biswas
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [sayan-biswas]
Approvers can indicate their approval by writing /approve
in a comment
Approvers can cancel approval by writing /approve cancel
in a comment
Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.
I just have one comment on the log format.
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b
Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.
I just have one comment on the log format.
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid *** TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10 *** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names *** Task name: test-pipelinerun-dd25b-task1 TaskRun name: test-pipelinerun-dd25b-task1 Pipeline name from params: test-pipelinerun-dd25b PipelineRun name from params: test-pipelinerun-dd25b
@sayan-biswas is your question here ^^ separate from the one in https://github.com/tektoncd/results/pull/715#discussion_r1500558338 ?
If it is, apologies I'm missing what you are getting at, and ask that you elaborate either here or in slack/team meetings. Thanks.
@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.
@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.
OK thanks for confirming the comments refer to the same core notion @sayan-biswas
As to memory footprint, I am waiting to hear back from Pavel on when he can start a smoke test, so not yet.
This is a factor as to why I want to merge #712 first. We have validated the memory footprint is fixed there. Also, I want to see how this change looks in our prod env for a bit with more worker threads, particularly with the workqueue depth, so as to potentially compare with this change.
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 56.1% | -13.2 |
/hold
merge after #712, but let's review get it ready to merge in parallel
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 56.7% | -12.6 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 55.0% | -14.3 |
/retest
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 54.7% | -14.6 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 53.7% | -15.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 53.7% | -15.7 |
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 69.3% | 53.7% | -15.7 |
/retest
Tests green at the moment with the tkn formatting, but there does seem to be some sort of intermittent flake in the vanilla e2e-tests I have not seen before today and the change to do tkn formatting.
=== RUN TestPipelineRun/check_annotations
e2e_test.go:238: Error waiting for PipelineRun creation: timed out waiting for the condition
=== RUN TestPipelineRun/check_deletion
e2e_test.go:254: Error waiting for PipelineRun deletion: timed out waiting for the condition
=== RUN TestPipelineRun/check_result
e2e_test.go:260: Result name not found
=== RUN TestPipelineRun/check_record
e2e_test.go:270: Record name not found
=== RUN TestPipelineRun/result_data_consistency
e2e_test.go:283: rpc error: code = InvalidArgument desc = rpc error: code = InvalidArgument desc = name must match (^[a-z0-9_-]{1,63})/results/([a-z0-9_-]{1,63}$)
--- FAIL: TestPipelineRun (120.07s)
--- FAIL: TestPipelineRun/check_annotations (60.02s)
--- FAIL: TestPipelineRun/check_deletion (60.02s)
--- SKIP: TestPipelineRun/check_result (0.00s)
--- SKIP: TestPipelineRun/check_record (0.00s)
--- FAIL: TestPipelineRun/result_data_consistency (0.00s)
let's review that change all the same @sayan-biswas , but I'll add some debug to that base e2e, see if I can uncover anything.
@gabemontero This is because the polling time set is too low. So before the TaskRun/PipelineRun is finishing, it is getting timeout and hence the test is failing. Sometimes the cluster is slow, so this happens intermittently.
https://github.com/tektoncd/results/blob/main/test%2Fe2e%2Fe2e_test.go#L145
@gabemontero This is because the polling time set is too low. So before the TaskRun/PipelineRun is finishing, it is getting timeout and hence the test is failing. Sometimes the cluster is slow, so this happens intermittently.
https://github.com/tektoncd/results/blob/main/test%2Fe2e%2Fe2e_test.go#L145
gotcha thanks for the history/insight and detail
if you want me to bump those various PollImmediate calls in that test as part of this, let me know
The following is the coverage report on the affected files.
Say /test pull-tekton-results-go-coverage
to re-run this coverage report
File | Old Coverage | New Coverage | Delta |
---|---|---|---|
pkg/watcher/reconciler/dynamic/dynamic.go | 64.2% | 55.8% | -8.4 |
rebase basically OK, but
I think there was a hiccup on the go mod download which impacted golint with the build-tests
re-running go mod tidy
and go mod vendor
produced no new changes in my branch
let's see what happens with
/retest
rebase basically OK, but
I think there was a hiccup on the go mod download which impacted golint with the build-tests
re-running
go mod tidy
andgo mod vendor
produced no new changes in my branchlet's see what happens with
/retest
flake confirmed
e2e's green
@sayan-biswas @avinal per our video conf today, feel free to provide feedback for changes here, or lgtm, as your availability allows. In the mean time, we'll start testing #712 on our end, and gather some base lines with increased worker thread pool, as well as revalidate the memory leak fix we confirmed in our performance stress test. Along with hopefully getting a performance stresss test with these changes soon, in order to helpe validate the merging of this path we all agree is better long term.
I'll remove the hold once all ^^ is sorted out.
thanks
So @pmacik's initial soak testing appears to uncover a new version of the goroutine leak (grpc client goroutines).
I suspect it is not too dissimilar what was was fixed with #712, but with the removal of the tkn client thread/channel/context machinations, we've fallen back into the warnings around https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream
@pmacik's pprof data showed a lot of parked threads in one of the new connection goroutines waiting on context's be done.
I've started working on an update to this PR that both:
- creates a child context that we can cancel when appropriate to presumably unblock the grpc cleanup
- also addresses the intermittent context canceled issues on the UpdateLog on the api server side by verifying the UpdateLog is complete before canceling the new child context
However, I appear to be hitting flakiness with the grpc unit test alternatives for file storage, even when stepping through the code. I can see repeated GetLogs fail when the data is clearly on my local disk.
When I next pick this up, I'll see about ignoring these errors in the unit tests, and see how the PR e2e's using kind do with everything.
@avinal @sayan-biswas @ramessesii2 FYI