vector icon indicating copy to clipboard operation
vector copied to clipboard

Vector randomly stops shipping certain k8s logs

Open danthegoodman1 opened this issue 2 years ago β€’ 67 comments

A note for the community

  • Please vote on this issue by adding a πŸ‘ reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)

Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: vector
  namespace: observability
  labels:
    app.kubernetes.io/name: vector
    app.kubernetes.io/instance: vector
    app.kubernetes.io/component: Agent
    app.kubernetes.io/version: "0.20.0-distroless-libc"
data:
  agent.yaml: |
    data_dir: /vector-data-dir
    api:
      enabled: true
      address: 127.0.0.1:8686
      playground: false
    sources:
      kubernetes_logs:
        type: kubernetes_logs
      host_metrics:
        filesystem:
          devices:
            excludes: [binfmt_misc]
          filesystems:
            excludes: [binfmt_misc]
          mountPoints:
            excludes: ["*/proc/sys/fs/binfmt_misc"]
        type: host_metrics
      internal_metrics:
        type: internal_metrics
    transforms:
      setlevel:
        type: remap
        inputs: [kubernetes_logs]
        source: |-
          .temp = parse_json!(.message)
          if !exists(parse_json!(.message).level) {
            .level = "other"
          } else {
            .level = .temp.level
          }
          if exists(.temp.uri) {
            .uri = .temp.uri
          }
          if exists(.temp.msg) {
            .msg = .temp.msg
          }
          if exists(.temp.lat) {
            .lat = .temp.lat
            .lon = .temp.lon
          }
          del(.temp)
    sinks:
      prom_exporter:
        type: prometheus_exporter
        inputs: [host_metrics, internal_metrics]
        address: 0.0.0.0:9090
      # stdout:
      #   type: console
      #   inputs: [setlevel]
      #   encoding:
      #     codec: json
      loki:
        type: loki
        inputs:
          - "setlevel"
        endpoint: https://logs-prod-us-central1.grafana.net
        compression: gzip
        # remove_label_fields: true
        encoding:
          codec: json
        auth:
          password: ${LOKI_PASSWORD}
          user: "${LOKI_USERNAME}"
          strategy: basic
        labels:
          namespace: "{{ kubernetes.pod_namespace }}"
          pod: "{{ kubernetes.pod_name }}"
          level: "{{ level }}"
          app_label: "{{ kubernetes.pod_label.app }}"
          node: "{{ kubernetes.pod_node_name }}"
          pod_owner: "{{ kubernetes.pod_owner }}"
          cluster: ${CLUSTER_NAME}


### Version

timberio/vector:0.20.0-distroless-libc

### Debug Output

```text
Working on trying to get relevant debugging output, sends a lot of TRACE logs currently :P

Example Data

{"func":"github.com/xxx/xxx/crdb.ConnectToDB","level":"debug","msg":"Connected to CRDB","time":"2022-03-29T17:43:21.015Z"}

Additional Context

Only logs from vector:

2022-03-29T17:35:57.183220Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 10 times.
 2022-03-29T17:35:57.183241Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:35:57.200043Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.
 2022-03-29T17:36:32.201827Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
 been rate limited 8 times.
 2022-03-29T17:36:32.201877Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
 l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
 2022-03-29T17:36:32.437566Z  WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
 eing rate limited.

References

Similar to this I've had in the past https://github.com/vectordotdev/vector/discussions/8634

danthegoodman1 avatar Mar 29 '22 17:03 danthegoodman1

I will add that restart vector seems to solve the issue, but that's not a sustainable solution.

danthegoodman1 avatar Mar 29 '22 17:03 danthegoodman1

It's also hard to get debug logs because this is not something that I can reproduce quickly, it's usually something we notice after 3-10 days when we try to look at logs and there are none.

danthegoodman1 avatar Mar 29 '22 17:03 danthegoodman1

Hi @danthegoodman1 !

It seems like some of the incoming events are not JSON. I might suggest augmenting your remap program to log the event that fails to parse as JSON like this:

.temp, err  = parse_json(.message)
if err != null {
  log("failed to parse .message as JSON: " + .message)
} else {
  // message successfully parsed as JSON
}

jszwedko avatar Mar 29 '22 18:03 jszwedko

As an aside, you are currently parsing .message as JSON twice. I'd recommend making:

if !exists(parse_json!(.message).level) {

instead

f !exists(.temp.level) {

jszwedko avatar Mar 29 '22 18:03 jszwedko

Thanks for the quick response @jszwedko , very few of our logs are not JSON (some panics, some startup ones) but an enormous majority are JSON logs and they are the ones not coming through.

Will add the suggestions in the meantime

danthegoodman1 avatar Mar 29 '22 18:03 danthegoodman1

@danthegoodman1 ah, I see. The log messages you shared above just indicate JSON parsing failures. Are there more error logs?

jszwedko avatar Mar 29 '22 18:03 jszwedko

No, that is what is so strange @jszwedko, makes it real hard to debug πŸ™ƒ

danthegoodman1 avatar Mar 29 '22 18:03 danthegoodman1

Also getting this, trying to work through why this is

 error[E630]: fallible argument
   β”Œβ”€ :3:7
   β”‚
 3 β”‚   log("failed to parse .message as JSON: " + .message)
   β”‚       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   β”‚       β”‚
   β”‚       this expression can fail
   β”‚       handle the error before passing it in as an argument
   β”‚
   = see documentation about error handling at https://errors.vrl.dev/#handling
   = see language documentation at https://vrl.dev

Will update with the solution when I find it

danthegoodman1 avatar Mar 29 '22 18:03 danthegoodman1

Updated to

log("failed to parse .message as JSON: " + .message ?? "no mesage!")

danthegoodman1 avatar Mar 29 '22 18:03 danthegoodman1

Here are some updated logs, seems like some of the JSON logs are the pods we don't control in EKS :P

 2022-03-29T18:27:59.052628Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:27:49.577Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:27:49 [DEBUG] memberlist: Initiating push/pull sync with: 01FYC8DZXMP6ENGHGKM4PBK61Q 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
 _position=53
 2022-03-29T18:27:59.052677Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:12.427827Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:12.427844Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:02 http:
 TLS handshake error from 10.0.167.22:11247: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:12.427861Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:17.165804Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
 2022-03-29T18:28:17.165830Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:17 http:
 TLS handshake error from 10.0.167.22:60721: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:19.164409Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:28:15.586Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:15 [DEBUG] memberlist: Stream connection from=10.0.167.205:34340\\n\"" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:19.164463Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:27.218994Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:27.219014Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:22 http:
 TLS handshake error from 10.0.167.22:27699: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:37.171126Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:27 http:
 TLS handshake error from 10.0.167.22:34561: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:37.171168Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:38.940324Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new
  file to watch. file=/var/log/pods/eth-scanner_eth-scanner-27476308-hbml9_05116106-4a27-49a2-ba6e-c811190e304a/eth-scanner/0.log
 2022-03-29T18:28:39.221000Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped w
 atching file. file=/var/log/pods/eth-scanner_eth-scanner-27476306-gkzmk_0a02fc81-f83a-4d3a-925b-37febe1dab62/eth-scanner/0.log
 2022-03-29T18:28:47.167084Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
 2022-03-29T18:28:47.167106Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:47 http:
 TLS handshake error from 10.0.167.22:48077: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:47.234267Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
 2022-03-29T18:28:52.426194Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
 2022-03-29T18:28:52.426218Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:52 http:
 TLS handshake error from 10.0.167.22:62639: EOF" internal_log_rate_secs=1 vrl_position=53
 2022-03-29T18:28:59.187233Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
 18:28:49.579Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:49 [DEBUG] memberlist: Initiating push/pull sync with: xxx 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
 _position=53
 2022-03-29T18:28:59.187428Z  INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
 7.170407       1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.

danthegoodman1 avatar Mar 29 '22 18:03 danthegoodman1

This has just started happening again, no error or warning logs. Anything else I should be looking for or trying? We can't be dropping our production logs.

danthegoodman1 avatar Mar 30 '22 14:03 danthegoodman1

@jszwedko one thing I am noticing is that we don't get the logs about vector finding new files. This is easier to reproduce on things that come and go relatively quickly like cron jobs (we leave the completed pod around so logs can be collected)

danthegoodman1 avatar Mar 30 '22 15:03 danthegoodman1

One thing to add: We checked the mounted /var/logs and the files with the missing logs are there, it seems like the event about the new files doesn't get triggered. We're running on AWS EKS and /proc/sys/fs/inotify/max_user_watches is 524288.

Hades32 avatar Mar 30 '22 15:03 Hades32

@danthegoodman1 If new logs were not "found" in time, maybe try to reduce glob_minimum_cooldown_ms?

It's 60000 ms in kubernetes_logs source, but 1000 ms in file source.

wtchangdm avatar Apr 11 '22 09:04 wtchangdm

@wtchangdm will try, but it’s the case for long running pods too where it just stops recording logs for those pods. Those files definitely stay around longer than 60s

danthegoodman1 avatar Apr 11 '22 10:04 danthegoodman1

@danthegoodman1 BTW, there is an issue I am also tracking: https://github.com/vectordotdev/vector/issues/8616. Not sure if it's related, though.

wtchangdm avatar Apr 11 '22 10:04 wtchangdm

Interesting, but we don’t get any warnings or errors like that does.

danthegoodman1 avatar Apr 11 '22 10:04 danthegoodman1

@danthegoodman1 could you try with the latest release, 0.21.1? The kubernetes_logs source was rewritten there to use the community supported kube-rs crate to interact with Kubernetes that we expect may have fixed a few issues.

Though, in this case, it does seem like the issue is with file discovery rather than metadata enrichment.

jszwedko avatar Apr 25 '22 18:04 jszwedko

@jszwedko I will say that so far we haven't noticed any issues since we dropped the glob cooldown to 8000, but we also surely have not been in our logs as much.

Will update now and see how it works

danthegoodman1 avatar Apr 25 '22 18:04 danthegoodman1

@jszwedko This went very not well, was printing this extremely quickly:

 2022-04-25T18:40:35.870916Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.871396Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.872182Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.873367Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.874306Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))

danthegoodman1 avatar Apr 25 '22 18:04 danthegoodman1

~Ah ok I needed to update the cluster role tags too :)~ still getting it...

danthegoodman1 avatar Apr 25 '22 18:04 danthegoodman1

@jszwedko This went very not well, was printing this extremely quickly:

 2022-04-25T18:40:35.870916Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.871396Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.872182Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.873367Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
 forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
 2022-04-25T18:40:35.874306Z  WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
 den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))

The ClusterRole needs to allow for the list verb as well now, see here

spencergilbert avatar Apr 25 '22 18:04 spencergilbert

yep, just made that change :)

danthegoodman1 avatar Apr 25 '22 18:04 danthegoodman1

After sending ~50log/s I can see the following errors show, then logs stop shipping. It prints a ton of the error logs.

It is able to ship them for about 30 seconds before it stops sending logs

2022-04-26T13:21:29.462788Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
 annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Writing slot 131412285 for consumer solanascanner-1\\\",\\\"time\\\":\\\"2022-04-26T13:21:
 29.450Z\\\"}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.45062706Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(20
 22-04-26T13:21:29.462766592Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definiti
 on: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_f
 ailed" stage="processing"
 2022-04-26T13:21:29.497575Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
 annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Wrote offset in 35.128858ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.485Z\\\"}\\n\",\"stre
 am\":\"stderr\",\"time\":\"2022-04-26T13:21:29.485822529Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:29.49755
 5692Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition { collec
 tion: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="processin
 g"
 2022-04-26T13:21:29.505901Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
 ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got slots in 167.388115ms got slots [131412285], len 1\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\
 "}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502718488Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T
 13:21:29.505885264Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Defin
 ition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" st
 age="processing"
 2022-04-26T13:21:29.505972Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
 ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
 ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got same top slot 131412285, sleeping for 250ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\"}\\n\"
 ,\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502731568Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:2
 9.505966965Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition {
  collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="pr
 ocessing"
 2022-04-26T13:21:29.576577Z  INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
 s}:file_server: vector::internal_events::file::source: Stopped watching file. file=/var/log/pods/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3e
 f46c17c9cd/solanascanner/0.log

danthegoodman1 avatar Apr 26 '22 13:04 danthegoodman1

It does not affect watching other pods, and CPU never exceeds ~20 mCPU and MEM never exceeds 33 MB for the vector pods

danthegoodman1 avatar Apr 26 '22 13:04 danthegoodman1

Seems related to https://github.com/vectordotdev/vector/issues/10122#issuecomment-1026006657

danthegoodman1 avatar Apr 26 '22 13:04 danthegoodman1

Seems related to #10122 (comment)

Possibly - the other answer would be that solanascanner was deleted in cluster before all of it's log lines were processed. When that happens delay_deletion_ms comes into play as it determines how long we hold onto old metadata before dumping it ourselves. Increasing that value will increase memory usage (however much it takes to hold onto the resource metadata), but allow for more time to enrich logs from deleted pods.

spencergilbert avatar Apr 26 '22 13:04 spencergilbert

It seems reducing the glob time to 3s is working at this rate. I also reduced the number of logs printed to reduce the frequency of log rotation

danthegoodman1 avatar Apr 26 '22 13:04 danthegoodman1

Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well

danthegoodman1 avatar Apr 26 '22 13:04 danthegoodman1

Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well

Can you share the setup you're using so I can reproduce?

spencergilbert avatar Apr 26 '22 13:04 spencergilbert