woodpecker icon indicating copy to clipboard operation
woodpecker copied to clipboard

k8s agent fails to tail step logs

Open praveen-livspace opened this issue 1 year ago • 16 comments

Component

agent

Describe the bug

For some of our repos, trivy scan step, triggers the agent to log this error. I noticed this not just with trivy but also with megalinter.

{ "level": "error", "error": "rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8", "time": "2023-08-20T07:18:46Z", "message": "grpc error: log(): code: Internal: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8" }

Trivy command executed as part of the step.

trivy fs \ --exit-code 1 \ --skip-dirs vendor,node_modules \ --report summary \ --severity CRITICAL \ --scanners vuln \ --ignore-unfixed \ --quiet \ .

System Info

{"source":"https://github.com/woodpecker-ci/woodpecker","version":"next-0c282e86e8"}

Additional context

No response

Validations

  • [X] Read the Contributing Guidelines.
  • [X] Read the docs.
  • [X] Check that there isn't already an issue that reports the same bug to avoid creating a duplicate.
  • [X] Checked that the bug isn't fixed in the next version already [https://woodpecker-ci.org/faq#which-version-of-woodpecker-should-i-use]
  • [x] Check that this is a concrete bug. For Q&A join our Discord Chat Server or the Matrix room.

praveen-livspace avatar Aug 20 '23 15:08 praveen-livspace

Is this still an issue?

qwerty287 avatar Feb 11 '24 12:02 qwerty287

yes.... have the same problem (woodpecker 2.3.0).... making woodpecker-agent spin indefinitely using a lot of cpu... have to kill the agent frequently to make it work again.... it's annoying....

metawave avatar Feb 16 '24 16:02 metawave

Can you get debug logs?

qwerty287 avatar Feb 17 '24 08:02 qwerty287

I have the same issue (2.3.0 with k8s agent). This example workflow triggers this issue:

steps:
  import-gpg:
    image: debian:stable-slim
    commands:
      - apt-get -qq update 
      - apt-get -qq install --yes --no-install-recommends ca-certificates wget gpg gpg-agent dirmngr
      - wget --quiet https://downloads.apache.org/maven/KEYS
      - gpg --import KEYS

While importing the KEYS woodpecker agent logs:

{"level":"error","error":"rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8","time":"2024-02-21T10:02:11Z","message":"grpc error: log(): code: Internal"}

sschnabe avatar Feb 21 '24 10:02 sschnabe

I wrote that above already: please provide debug logs if possible. They contain extra fields like the file and line where this is logged. Otherwise it's hard to help here (at least for me, maybe the people knowing more about k8s can tell you the issue).

qwerty287 avatar Feb 21 '24 11:02 qwerty287

With WOODPECKER_DEBUG_PRETTY=true and WOODPECKER_LOG_LEVEL=trace i got:

1:07PM TRC src/pipeline/rpc/log_entry.go:82 > grpc write line: gpg: key 4F9BDB78DB596386: public key "Vincent Siveton <[email protected]>" imported
gpg: key B95BBD3FA43C4492: 22 signatures not checked due to missing keys
gpg: key B95BBD3FA43C4492: public key "Carlos Sanchez <[email protected]>" imported
gpg: key E181870DC625BAFB: public key "Maria Odea Ching <[email protected]>" imported
gpg: key D2A765937BA507E8: public key "Mauro Talevi <[email protected]>" imported
gpg: key C92C5FEC70161C62: public key "Herve Boutemy <[email protected]>" imported
gpg: key 873A8E86B4372146: public key "Olivier Lamy <[email protected]>" imported
gpg: key 8F0FABD30F353251: public key "Dan Fabulich <[email protected]>" imported
gpg: key 0B7B8BED064C851C: public key "Rapha�l Pi�roni <[email protected]>" imported
gpg: key C2D879C8B0874707: public key "Milos Kleint <[email protected]>" imported
gpg: key 8FE7CB51365A46EF: public key "Mark Hobson <[email protected]>" imported
gpg: key A3F9CCC081C4177D: public key "Oleg Gusakov <[email protected]>" imported
gpg: key 30E6F80434A72A7F: public key "John Dennis Casey <[email protected]>" imported
gpg: key E8579C183571506F: public key "Rapha�l Pi�roni <[email protected]>" imported
gpg: key A8AD2A7E9A25CE21: public key "Barrie Treloar <[email protected]>" imported
gpg: key 3B58205B9D7013A9: public key "Dan T. Tran <[email protected]>" imported
 step-uuid=01HQ5WD8SBBCTMPE81QTQJS272
1:07PM ERR src/agent/rpc/client_grpc.go:335 > grpc error: log(): code: Internal error="rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8"

sschnabe avatar Feb 21 '24 13:02 sschnabe

This was a problem for me as well so I've attempted a fix here https://github.com/woodpecker-ci/woodpecker/pull/3601

fernandrone avatar Apr 08 '24 19:04 fernandrone

https://github.com/woodpecker-ci/woodpecker/pull/3601#issuecomment-2056975576

I haven't noticed any missing logs on our builds. However I think the true test would be to run a build that outputs non UTF-8 characters and see how Woodpecker behaves. Given the issue, I'd expect Woodpecker to not print them

Indeed, WP doesn't print them.

I've reproduced this with pipeline:

skip_clone: true
steps:
  logs:
    image: alpine
    commands:
      - apk add openssl
      - openssl rand 1000
      - echo Done

WP output:

+ apk add openssl
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/main/x86_64/APKINDEX.tar.gz
fetch https://dl-cdn.alpinelinux.org/alpine/v3.19/community/x86_64/APKINDEX.tar.gz
(1/3) Upgrading libcrypto3 (3.1.4-r5 -> 3.1.4-r6)
(2/3) Upgrading libssl3 (3.1.4-r5 -> 3.1.4-r6)
(3/3) Installing openssl (3.1.4-r6)
Executing busybox-1.36.1-r15.trigger
OK: 8 MiB in 16 packages
+ openssl rand 1000

wp-01hwa0ya0qqn1exr68j5ftj5tq.log

Pod was in succeeded state before pipeline cancellation.

But I didn't notice high CPU usage (at least after I cancelled pipeline):

> kubectl top pods -n test-scm-cicd
NAME                                 CPU(cores)   MEMORY(bytes)
gitea-0                              3m           125Mi
woodpecker-agent-d9b864c76-lqdgj     1m           10Mi
woodpecker-server-75c9794774-nglv7   3m           29Mi

Tested on Woodpecker 2.3.0 in Kubernetes. woodpecker-server.log woodpecker-agent.log

As you can see, Agent logs error even after pipeline was cancelled.


Now my pipeline is in status killed, but the step stays in the running.

zc-devs avatar Apr 25 '24 07:04 zc-devs

I had the feeling we use base64 encoding for the log data, but I can't find it in the agent code anymore 🤔. Shouldn't base64-encoding the log data solve this issue?

anbraten avatar Apr 25 '24 08:04 anbraten

I have a feeling that the encoding happens on the server at some point.

golyalpha avatar Apr 25 '24 09:04 golyalpha

I have a feeling that the encoding happens on the server at some point.

Same for me, but I have no idea where we do this 🙈 Searched for encoding/base64, but can't find where we do this for the log entries at all. Will try to find it using the debugger when I've some time. Any hints are more than welcome 😉

anbraten avatar Apr 25 '24 10:04 anbraten

we use base64 encoding for the log data

Nope:

https://github.com/woodpecker-ci/woodpecker/blob/d0057736f164a8a12e671ff3c39ac2c33b80a7f6/server/model/log.go#L33

postgres=# \c woodpecker_test
woodpecker_test=# \d log_entries
                             Table "public.log_entries"
 Column  |  Type   | Collation | Nullable |                 Default
---------+---------+-----------+----------+-----------------------------------------
 id      | bigint  |           | not null | nextval('log_entries_id_seq'::regclass)
 step_id | bigint  |           |          |
 time    | bigint  |           |          |
 line    | integer |           |          |
 data    | bytea   |           |          |
 created | bigint  |           |          |
 type    | integer |           |          |
Indexes:
    "log_entries_pkey" PRIMARY KEY, btree (id)
    "IDX_log_entries_step_id" btree (step_id)
woodpecker_test=# select data from log_entries where id=5;
                data
------------------------------------
 \x2b20676974206c66732066657463680a
(1 row)
woodpecker_test=# select convert_from(data, 'UTF-8') from log_entries where id=5;
  convert_from
-----------------
 + git lfs fetch+

(1 row)

Shouldn't base64-encoding the log data solve this issue?

Yes, I think.

zc-devs avatar Apr 25 '24 11:04 zc-devs

Thanks for the help.

Just had the feeling as we are getting the logs base64 encoded from the api 🤔

https://github.com/woodpecker-ci/woodpecker/blob/d0057736f164a8a12e671ff3c39ac2c33b80a7f6/web/src/components/repo/pipeline/PipelineLog.vue#L298

https://github.com/woodpecker-ci/woodpecker/blob/d0057736f164a8a12e671ff3c39ac2c33b80a7f6/web/src/components/repo/pipeline/PipelineLog.vue#L303

Anyways I guess the solution should be quite simple by base64 encoding in the agent code directly

anbraten avatar Apr 25 '24 11:04 anbraten

base64 encoding in the agent code

Should it be decoded after receiving / before storing in DB?

Edit 1:

If we do not want to store Base64 in DB, then I'm for solution mentioned by @fernandrone and @eliasscosta:

edit the .proto and change the string fields to bytes

zc-devs avatar Apr 25 '24 13:04 zc-devs

While doing some testing in #2072 I always got the grpc error: log(): code: Internal: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8 error when I was sending extremely long log lines.

anbraten avatar Apr 26 '24 23:04 anbraten

While doing some testing I always got the error when I was sending extremely long log lines

Seems, I cannot reproduce this.

But I've found other issues(?). Pipeline with 13MB "one"-line log:

skip_clone: true
steps:
  logs:
    image: alpine
    commands:
      - apk add openssl
      - openssl rand 10000000 | base64 -w 0
      - echo Done

Run number 182: I run pipeline. Do not remember status. Server pod restarted with panic. After that I see full log (414 lines). woodpecker-server.log woodpecker-agent.log

Run number 184: I run pipeline. Step stayed in not started state, therefore no logs. When I saw pipeline is finished in K3s logs, I refreshed web page: success, full log (415 lines).

Run number 185: I run pipeline. Step showed up in running state. When I saw in OpenLens, that some pod (Server or Agent) flashed orange, but there was no pod restarts. In UI there were 274 lines of truncated log. After refreshing the page - the same.

woodpecker-agent-2.log

Run number 187: Pipeline started with step in status running. Then it moved to success. In UI there were 290 lines of truncated logs. After refreshing the page - 289 lines. woodpecker-server.log woodpecker-agent.log

Run number 188: Right after 187. I run pipeline, then went to pipelines list page. Server pod restarted, Reason: Error - exit code: 2. Agent did not, but log is truncated (maybe some go selfhealing). After Server restarted, I see pipeline and step in success state. There is full log with 415 lines. woodpecker-server.log woodpecker-agent.log

Woodpecker 2.4.1, K3s 1.29.4.

zc-devs avatar Apr 27 '24 06:04 zc-devs