results icon indicating copy to clipboard operation
results copied to clipboard

switch from tkn client to k8s pod log retrieval; fix panic, add debug to e2e_gcs_test

Open gabemontero opened this issue 1 year ago • 43 comments

rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED

Changes

Fixes https://github.com/tektoncd/results/issues/720

alternative to https://github.com/tektoncd/results/pull/712 with regard to still addressing the goroutine/memory leak, swtiching from the tkn client to k8s client, but with the reduction of threads, allows for retry when appropriate on UpdateLog errors.

/kind bug

These changes address a goroutine/memory leak that existed with the prior main branch vesrion,

Additionally

  • switch from tkn client to k8s pod log retrieval
  • fix panic, add debug, to e2e_gcs_test
  • rm tkn cli dependency
  • go mod tidy, vendor
  • add rbac for pod listing
  • move streamLogs onto reconciler thread
  • dump default worker pool thread count to 32 when logs are enabled
  • switch from grpc CloseSend to CloseAndRecv to address intermittent canceled context errors

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you review them:

  • [n/a ] Has Docs included if any changes are user facing
  • [ n/a] Has Tests included if any functionality added or changed
  • [ /] Tested your changes locally (if this is a code change)
  • [ /] Follows the commit message standard
  • [ /] Meets the Tekton contributor standards (including functionality, content, code)
  • [/ ] Has a kind label. You can add a comment on this PR that contains /kind <type>. Valid types are bug, cleanup, design, documentation, feature, flake, misc, question, tep
  • [n/a ] Release notes block below has been updated with any user-facing changes (API changes, bug fixes, changes requiring upgrade notices or deprecation warnings)
  • [ n/a] Release notes contain the string "action required" if the change requires additional action from users switching to the new release

Release Notes

addresses UpdateLog calls from the watcher failing with canceled context, and allows for proper erro retry, while still fixing goroutine leak, which leads to overall memory leak, when storing logs

gabemontero avatar Feb 20 '24 21:02 gabemontero

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7

tekton-robot avatar Feb 20 '24 21:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7

tekton-robot avatar Feb 20 '24 21:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.4% -14.0

tekton-robot avatar Feb 20 '24 22:02 tekton-robot

@sayan-biswas @avinal @enarha @ramessesii2 @adambkaplan FYI a first pass at replacing tkn cli with generic k8s get pod logs API calls

I could see this replacing #712 after a little more refining.... but curious on opinions / feedback

also really close to getting rid of launching the separate goroutine in sendLogs, but could also see deferring that to a follow pull request that also dives into verifying UpdateLogs completed in the API server to completely remove the intermittent canceled context UpdateLog instances we are seeing in Pavel's soak test. But I took a first pass at that today and that is going to require some unit test overall, as I've discovered the unit tests are not fully wired to return fake data.

gabemontero avatar Feb 20 '24 22:02 gabemontero

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.7% -13.7

tekton-robot avatar Feb 21 '24 17:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.7% -7.7

tekton-robot avatar Feb 21 '24 21:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.5% -7.9

tekton-robot avatar Feb 21 '24 21:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 61.5% -7.9

tekton-robot avatar Feb 22 '24 17:02 tekton-robot

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sayan-biswas

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

tekton-robot avatar Feb 23 '24 11:02 tekton-robot

Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.

I just have one comment on the log format.

*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b

sayan-biswas avatar Feb 23 '24 11:02 sayan-biswas

Looks good to me. I ran it locally and the logs were stored perfectly. Although I did't check with bulk pipeline runs and larger logs.

I just have one comment on the log format.

*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-uid ***
TaskRun UID: fd819238-26b8-46c7-9e2c-f31a41fe7bff
PipelineRun UID from params: 79e2c2d5-748b-4790-8ef1-eab31c833f10
*** Logs for pod test-pipelinerun-dd25b-task1-pod container step-print-names ***
Task name: test-pipelinerun-dd25b-task1
TaskRun name: test-pipelinerun-dd25b-task1
Pipeline name from params: test-pipelinerun-dd25b
PipelineRun name from params: test-pipelinerun-dd25b

@sayan-biswas is your question here ^^ separate from the one in https://github.com/tektoncd/results/pull/715#discussion_r1500558338 ?

If it is, apologies I'm missing what you are getting at, and ask that you elaborate either here or in slack/team meetings. Thanks.

gabemontero avatar Feb 25 '24 15:02 gabemontero

@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.

sayan-biswas avatar Feb 25 '24 18:02 sayan-biswas

@gabemontero No, the one the comment is what I was referring. Everything is working for me. Did not test memory footprint though.

OK thanks for confirming the comments refer to the same core notion @sayan-biswas

As to memory footprint, I am waiting to hear back from Pavel on when he can start a smoke test, so not yet.

This is a factor as to why I want to merge #712 first. We have validated the memory footprint is fixed there. Also, I want to see how this change looks in our prod env for a bit with more worker threads, particularly with the workqueue depth, so as to potentially compare with this change.

gabemontero avatar Feb 26 '24 17:02 gabemontero

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 56.1% -13.2

tekton-robot avatar Feb 26 '24 21:02 tekton-robot

/hold

merge after #712, but let's review get it ready to merge in parallel

gabemontero avatar Feb 26 '24 21:02 gabemontero

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 56.7% -12.6

tekton-robot avatar Feb 26 '24 21:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 55.0% -14.3

tekton-robot avatar Feb 26 '24 21:02 tekton-robot

/retest

gabemontero avatar Feb 26 '24 22:02 gabemontero

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 54.7% -14.6

tekton-robot avatar Feb 26 '24 22:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 53.7% -15.7

tekton-robot avatar Feb 26 '24 23:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 53.7% -15.7

tekton-robot avatar Feb 26 '24 23:02 tekton-robot

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 69.3% 53.7% -15.7

tekton-robot avatar Feb 26 '24 23:02 tekton-robot

/retest

gabemontero avatar Feb 26 '24 23:02 gabemontero

Tests green at the moment with the tkn formatting, but there does seem to be some sort of intermittent flake in the vanilla e2e-tests I have not seen before today and the change to do tkn formatting.

 === RUN   TestPipelineRun/check_annotations
    e2e_test.go:238: Error waiting for PipelineRun creation: timed out waiting for the condition
=== RUN   TestPipelineRun/check_deletion
    e2e_test.go:254: Error waiting for PipelineRun deletion: timed out waiting for the condition
=== RUN   TestPipelineRun/check_result
    e2e_test.go:260: Result name not found
=== RUN   TestPipelineRun/check_record
    e2e_test.go:270: Record name not found
=== RUN   TestPipelineRun/result_data_consistency
    e2e_test.go:283: rpc error: code = InvalidArgument desc = rpc error: code = InvalidArgument desc = name must match (^[a-z0-9_-]{1,63})/results/([a-z0-9_-]{1,63}$)
--- FAIL: TestPipelineRun (120.07s)
    --- FAIL: TestPipelineRun/check_annotations (60.02s)
    --- FAIL: TestPipelineRun/check_deletion (60.02s)
    --- SKIP: TestPipelineRun/check_result (0.00s)
    --- SKIP: TestPipelineRun/check_record (0.00s)
    --- FAIL: TestPipelineRun/result_data_consistency (0.00s) 

let's review that change all the same @sayan-biswas , but I'll add some debug to that base e2e, see if I can uncover anything.

gabemontero avatar Feb 27 '24 02:02 gabemontero

@gabemontero This is because the polling time set is too low. So before the TaskRun/PipelineRun is finishing, it is getting timeout and hence the test is failing. Sometimes the cluster is slow, so this happens intermittently.

https://github.com/tektoncd/results/blob/main/test%2Fe2e%2Fe2e_test.go#L145

sayan-biswas avatar Feb 27 '24 06:02 sayan-biswas

@gabemontero This is because the polling time set is too low. So before the TaskRun/PipelineRun is finishing, it is getting timeout and hence the test is failing. Sometimes the cluster is slow, so this happens intermittently.

https://github.com/tektoncd/results/blob/main/test%2Fe2e%2Fe2e_test.go#L145

gotcha thanks for the history/insight and detail

if you want me to bump those various PollImmediate calls in that test as part of this, let me know

gabemontero avatar Feb 27 '24 10:02 gabemontero

The following is the coverage report on the affected files. Say /test pull-tekton-results-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/watcher/reconciler/dynamic/dynamic.go 64.2% 55.8% -8.4

tekton-robot avatar Feb 27 '24 16:02 tekton-robot

rebase basically OK, but

I think there was a hiccup on the go mod download which impacted golint with the build-tests

re-running go mod tidy and go mod vendor produced no new changes in my branch

let's see what happens with

/retest

gabemontero avatar Feb 27 '24 18:02 gabemontero

rebase basically OK, but

I think there was a hiccup on the go mod download which impacted golint with the build-tests

re-running go mod tidy and go mod vendor produced no new changes in my branch

let's see what happens with

/retest

flake confirmed

e2e's green

@sayan-biswas @avinal per our video conf today, feel free to provide feedback for changes here, or lgtm, as your availability allows. In the mean time, we'll start testing #712 on our end, and gather some base lines with increased worker thread pool, as well as revalidate the memory leak fix we confirmed in our performance stress test. Along with hopefully getting a performance stresss test with these changes soon, in order to helpe validate the merging of this path we all agree is better long term.

I'll remove the hold once all ^^ is sorted out.

thanks

gabemontero avatar Feb 27 '24 19:02 gabemontero

So @pmacik's initial soak testing appears to uncover a new version of the goroutine leak (grpc client goroutines).

I suspect it is not too dissimilar what was was fixed with #712, but with the removal of the tkn client thread/channel/context machinations, we've fallen back into the warnings around https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream

@pmacik's pprof data showed a lot of parked threads in one of the new connection goroutines waiting on context's be done.

I've started working on an update to this PR that both:

  • creates a child context that we can cancel when appropriate to presumably unblock the grpc cleanup
  • also addresses the intermittent context canceled issues on the UpdateLog on the api server side by verifying the UpdateLog is complete before canceling the new child context

However, I appear to be hitting flakiness with the grpc unit test alternatives for file storage, even when stepping through the code. I can see repeated GetLogs fail when the data is clearly on my local disk.

When I next pick this up, I'll see about ignoring these errors in the unit tests, and see how the PR e2e's using kind do with everything.

@avinal @sayan-biswas @ramessesii2 FYI

gabemontero avatar Feb 29 '24 13:02 gabemontero