results icon indicating copy to clipboard operation
results copied to clipboard

Individual log is not retrievable via the name returned in the log list

Open iainsproat opened this issue 1 year ago • 9 comments

Expected Behavior

curl -k https://speckle.internal:8088/apis/results.tekton.dev/v1alpha2/parents/bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs

This results in the following as expected:

{"records":[{"name":"bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae", "id":"58f21208-d056-4506-a1c8-124b3e34c830", "uid":"58f21208-d056-4506-a1c8-124b3e34c830", "data":{"type":"results.tekton.dev/v1alpha2.Log", "value":"base64encodedlogs"}, "etag":"58f21208-d056-4506-a1c8-124b3e34c830-1687799787144731088", "createdTime":"2023-06-26T17:16:27.144731Z", "createTime":"2023-06-26T17:16:27.144731Z", "updatedTime":"2023-06-26T17:16:27.144731Z", "updateTime":"2023-06-26T17:16:27.144731Z"}], "nextPageToken":""}`

I expect that using the .records[0].name would return the individual log record that can be seen in the list, e.g.:

curl -k https://speckle.internal:8088/apis/results.tekton.dev/v1alpha2/parents/bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae

I may be doing something incorrect here, but it seems to match the API documentation.

What I expect to have returned is the following:

{"name":"bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae", "id":"58f21208-d056-4506-a1c8-124b3e34c830", "uid":"58f21208-d056-4506-a1c8-124b3e34c830", "data":{"type":"results.tekton.dev/v1alpha2.Log", "value":"base64encodedlogs"}

Actual Behavior

Retrieving the .records[0].name and attempting to access the resource results in the log record not being found.

curl -k https://speckle.internal:8088/apis/results.tekton.dev/v1alpha2/parents/bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae
{"error":{"code":5,"message":"Log doesn't exist","details":[]}}

Steps to Reproduce the Problem

Steps detailed above

Additional Info

  • Kubernetes version:

    Output of kubectl version:

    Client Version: v1.27.3
    Kustomize Version: v5.0.1
    Server Version: v1.27.3
    
  • Tekton Pipeline version:

    Client version: 0.31.1
    Pipeline version: v0.48.0
    

iainsproat avatar Jun 26 '23 17:06 iainsproat

The API pod logs if this helps:

{"level":"error","ts":1687800526.2499058,"caller":"v1alpha2/logs.go:61","msg":"no logs exist for bfeb4b5d45/results/47373980-cbcb-4b57-8746-40cc5fd75649/logs/77a71268-1422-39f4-8b37-8e5264c6f2ae","stacktrace":"github.com/tektoncd/results/pkg/api/server/v1alpha2.(*Server).GetL
og\n\tgithub.com/tektoncd/results/pkg/api/server/v1alpha2/logs.go:61\ngithub.com/tektoncd/results/proto/v1alpha2/results_go_proto._Logs_GetLog_Handler\n\tgithub.com/tektoncd/results/proto/v1alpha2/results_go_proto/api_grpc.pb.go:579\ngithub.com/grpc-ecosystem/go-grpc-promethe
us.(*ServerMetrics).StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/[email protected]/server_metrics.go:121\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/[email protected]/chain.go:49\ngithub.
com/grpc-ecosystem/go-grpc-middleware/auth.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/[email protected]/auth/auth.go:66\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.com/grpc-ecosystem/[email protected]/cha
in.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/logging/zap.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/[email protected]/logging/zap/server_interceptors.go:53\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1.1.1\n\tgithub.co
m/grpc-ecosystem/[email protected]/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware/tags.StreamServerInterceptor.func1\n\tgithub.com/grpc-ecosystem/[email protected]/tags/interceptors.go:39\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServ
er.func1.1.1\n\tgithub.com/grpc-ecosystem/[email protected]/chain.go:49\ngithub.com/grpc-ecosystem/go-grpc-middleware.ChainStreamServer.func1\n\tgithub.com/grpc-ecosystem/[email protected]/chain.go:58\ngoogle.golang.org/grpc.(*Server).processStreamingRPC\n\tgo
ogle.golang.org/[email protected]/server.go:1627\ngoogle.golang.org/grpc.(*Server).handleStream\n\tgoogle.golang.org/[email protected]/server.go:1708\ngoogle.golang.org/grpc.(*Server).serveStreams.func1.2\n\tgoogle.golang.org/[email protected]/server.go:965"}

iainsproat avatar Jun 26 '23 17:06 iainsproat

@iainsproat Which log type did you set in the config? File or S3 Can you share the log message from UpdateLog method?

sayan-biswas avatar Jun 27 '23 07:06 sayan-biswas

Also logs are not enabled by default. Although the API can be enabled from the config, separate configuration are required. For File type, PVC need to be deployed and for S3 type, the rest of the S3 related parameter in the config needs to be set.
There is a PR merged recently which makes this easier through kustomize. https://github.com/tektoncd/results/blob/main/config/README.md#deploying-with-a-kustomization

sayan-biswas avatar Jun 27 '23 07:06 sayan-biswas

Thanks for responding @sayan-biswas

Your suggestion about looking at the logs for the UpdateLog method pointed me in the right location. I can see that there appears to be a race condition in Tekton Results Watcher between the deletion of the Task Run by Tekton Results Watcher after it has completed and the uploading of logs by the Watcher.

I am running Tekton results watcher with the following flag: -completed_run_grace_period -1s. It has a negative value to immediately 'garbage collect' the completed TaskRuns & PipelineRuns, as suggested by the documentation.

In the Tekton Watcher logs I can see that garbage collection occurring successfully:

{"level":"info","time":"2023-06-27T10:21:24.301Z","logger":"watcher","caller":"dynamic/dynamic.go:216","msg":"Deleting object","commit":"26df57c-dirty","knative.dev/traceid":"e55448ea-474f-449f-8f5e-6b1800c70bb5","knative.dev/key":"789b32327d/hello-task-run","results.tekton.dev/kind":"TaskRun","results.tekton.dev/result":"789b32327d/results/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/record":"789b32327d/results/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43/records/24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/gracePeriod":"-1s","results.tekton.dev/uid":"24f8d7c5-0caa-4999-bd4c-52ffedcf6f43","results.tekton.dev/time-taken-seconds":0}

However, some time later Tekton Watcher then logs the following failure:

{"level":"error","time":"2023-06-27T10:21:24.320Z","logger":"watcher","caller":"dynamic/dynamic.go:319","msg":"Error streaming log","commit":"26df57c-dirty","knative.dev/traceid":"e55448ea-474f-449f-8f5e-6b1800c70bb5","knative.dev/key":"789b32327d/hello-task-run","results.tekton.dev/kind":"TaskRun","namespace":"789b32327d","kind":"TaskRun","name":"hello-task-run","error":"error reading from tkn reader: Unable to get TaskRun: taskruns.tekton.dev \"hello-task-run\" not found","stacktrace":"github.com/tektoncd/results/pkg/watcher/reconciler/dynamic.(*Reconciler).sendLog.func1\n\tgithub.com/tektoncd/results/pkg/watcher/reconciler/dynamic/dynamic.go:319"}

As it is an intermittent issue and not consistent I am confident that S3 is configured correctly. I can query the blob storage provider directly and see the data for other Tekton Task and Pipeline Runs (which presumably won the race condition!) have been stored, and can be queried via the API.

My expectation is that the garbage collection of completed TaskRuns/PipelineRuns would wait until all the logs are confirmed as having been successfully stored before proceeding to delete the completed TaskRun or PipelineRun.

In the meantime, as a workaround, I'll increase the value of -completed_run_grace_period flag to be a larger positive number to allow sufficient time for the logging to completed prior to garbage collection.

I've attached the entire logs output for both the API and Watcher. During this logged period I started Tekton Results, applied a Task, and then applied a TaskRun. tekton-watcher-logs.txt tekton-api-logs.txt

iainsproat avatar Jun 27 '23 10:06 iainsproat

@iainsproat Yes, this happens because the log streaming from the watcher to the API server only starts after the PipelineRun/Taskrun moves to completed of failed status. If the pod gets deleted from an aggressive pruning such as immediately then it creates a race condition. Not sure about what can be done about fetching the logs, but the pruning conditions can be modified to wait for the streaming to complete even if it's set to -1s .

sayan-biswas avatar Jun 27 '23 14:06 sayan-biswas

/assign

sayan-biswas avatar Jun 27 '23 14:06 sayan-biswas

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale with a justification. Stale issues rot after an additional 30d of inactivity and eventually close. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle stale

Send feedback to tektoncd/plumbing.

tekton-robot avatar Sep 25 '23 15:09 tekton-robot

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten with a justification. Rotten issues close after an additional 30d of inactivity. If this issue is safe to close now please do so with /close with a justification. If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

tekton-robot avatar Oct 25 '23 15:10 tekton-robot

/assign

ramessesii2 avatar Feb 05 '24 12:02 ramessesii2