vector
vector copied to clipboard
Vector randomly stops shipping certain k8s logs
A note for the community
- Please vote on this issue by adding a π reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
After a while, our vector daemonset will randomly stop shipping logs for a select service (some other pods will keep shipping logs)
Configuration
apiVersion: v1
kind: ConfigMap
metadata:
name: vector
namespace: observability
labels:
app.kubernetes.io/name: vector
app.kubernetes.io/instance: vector
app.kubernetes.io/component: Agent
app.kubernetes.io/version: "0.20.0-distroless-libc"
data:
agent.yaml: |
data_dir: /vector-data-dir
api:
enabled: true
address: 127.0.0.1:8686
playground: false
sources:
kubernetes_logs:
type: kubernetes_logs
host_metrics:
filesystem:
devices:
excludes: [binfmt_misc]
filesystems:
excludes: [binfmt_misc]
mountPoints:
excludes: ["*/proc/sys/fs/binfmt_misc"]
type: host_metrics
internal_metrics:
type: internal_metrics
transforms:
setlevel:
type: remap
inputs: [kubernetes_logs]
source: |-
.temp = parse_json!(.message)
if !exists(parse_json!(.message).level) {
.level = "other"
} else {
.level = .temp.level
}
if exists(.temp.uri) {
.uri = .temp.uri
}
if exists(.temp.msg) {
.msg = .temp.msg
}
if exists(.temp.lat) {
.lat = .temp.lat
.lon = .temp.lon
}
del(.temp)
sinks:
prom_exporter:
type: prometheus_exporter
inputs: [host_metrics, internal_metrics]
address: 0.0.0.0:9090
# stdout:
# type: console
# inputs: [setlevel]
# encoding:
# codec: json
loki:
type: loki
inputs:
- "setlevel"
endpoint: https://logs-prod-us-central1.grafana.net
compression: gzip
# remove_label_fields: true
encoding:
codec: json
auth:
password: ${LOKI_PASSWORD}
user: "${LOKI_USERNAME}"
strategy: basic
labels:
namespace: "{{ kubernetes.pod_namespace }}"
pod: "{{ kubernetes.pod_name }}"
level: "{{ level }}"
app_label: "{{ kubernetes.pod_label.app }}"
node: "{{ kubernetes.pod_node_name }}"
pod_owner: "{{ kubernetes.pod_owner }}"
cluster: ${CLUSTER_NAME}
### Version
timberio/vector:0.20.0-distroless-libc
### Debug Output
```text
Working on trying to get relevant debugging output, sends a lot of TRACE logs currently :P
Example Data
{"func":"github.com/xxx/xxx/crdb.ConnectToDB","level":"debug","msg":"Connected to CRDB","time":"2022-03-29T17:43:21.015Z"}
Additional Context
Only logs from vector:
2022-03-29T17:35:57.183220Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
been rate limited 10 times.
2022-03-29T17:35:57.183241Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
2022-03-29T17:35:57.200043Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
eing rate limited.
2022-03-29T17:36:32.201827Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] has
been rate limited 8 times.
2022-03-29T17:36:32.201877Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Mapping failed with event. error="function cal
l error for \"parse_json\" at (8:29): unable to parse json: trailing characters at line 1 column 5" internal_log_rate_secs=30
2022-03-29T17:36:32.437566Z WARN transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vector::internal_events::remap: Internal log [Mapping failed with event.] is b
eing rate limited.
References
Similar to this I've had in the past https://github.com/vectordotdev/vector/discussions/8634
I will add that restart vector seems to solve the issue, but that's not a sustainable solution.
It's also hard to get debug logs because this is not something that I can reproduce quickly, it's usually something we notice after 3-10 days when we try to look at logs and there are none.
Hi @danthegoodman1 !
It seems like some of the incoming events are not JSON. I might suggest augmenting your remap
program to log the event that fails to parse as JSON like this:
.temp, err = parse_json(.message)
if err != null {
log("failed to parse .message as JSON: " + .message)
} else {
// message successfully parsed as JSON
}
As an aside, you are currently parsing .message
as JSON twice. I'd recommend making:
if !exists(parse_json!(.message).level) {
instead
f !exists(.temp.level) {
Thanks for the quick response @jszwedko , very few of our logs are not JSON (some panics, some startup ones) but an enormous majority are JSON logs and they are the ones not coming through.
Will add the suggestions in the meantime
@danthegoodman1 ah, I see. The log messages you shared above just indicate JSON parsing failures. Are there more error logs?
No, that is what is so strange @jszwedko, makes it real hard to debug π
Also getting this, trying to work through why this is
error[E630]: fallible argument
ββ :3:7
β
3 β log("failed to parse .message as JSON: " + .message)
β ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
β β
β this expression can fail
β handle the error before passing it in as an argument
β
= see documentation about error handling at https://errors.vrl.dev/#handling
= see language documentation at https://vrl.dev
Will update with the solution when I find it
Updated to
log("failed to parse .message as JSON: " + .message ?? "no mesage!")
Here are some updated logs, seems like some of the JSON logs are the pods we don't control in EKS :P
2022-03-29T18:27:59.052628Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
18:27:49.577Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:27:49 [DEBUG] memberlist: Initiating push/pull sync with: 01FYC8DZXMP6ENGHGKM4PBK61Q 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
_position=53
2022-03-29T18:27:59.052677Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:12.427827Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
2022-03-29T18:28:12.427844Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:02 http:
TLS handshake error from 10.0.167.22:11247: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:12.427861Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:17.165804Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
2022-03-29T18:28:17.165830Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:17 http:
TLS handshake error from 10.0.167.22:60721: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:19.164409Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
18:28:15.586Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:15 [DEBUG] memberlist: Stream connection from=10.0.167.205:34340\\n\"" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:19.164463Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:27.218994Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
2022-03-29T18:28:27.219014Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:22 http:
TLS handshake error from 10.0.167.22:27699: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:37.171126Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:27 http:
TLS handshake error from 10.0.167.22:34561: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:37.171168Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:38.940324Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Found new
file to watch. file=/var/log/pods/eth-scanner_eth-scanner-27476308-hbml9_05116106-4a27-49a2-ba6e-c811190e304a/eth-scanner/0.log
2022-03-29T18:28:39.221000Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_logs}:file_server: vector::internal_events::file::source: Stopped w
atching file. file=/var/log/pods/eth-scanner_eth-scanner-27476306-gkzmk_0a02fc81-f83a-4d3a-925b-37febe1dab62/eth-scanner/0.log
2022-03-29T18:28:47.167084Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 2 times.
2022-03-29T18:28:47.167106Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:47 http:
TLS handshake error from 10.0.167.22:48077: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:47.234267Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
2022-03-29T18:28:52.426194Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] has been rate limited 1 times.
2022-03-29T18:28:52.426218Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: 2022/03/29 18:28:52 http:
TLS handshake error from 10.0.167.22:62639: EOF" internal_log_rate_secs=1 vrl_position=53
2022-03-29T18:28:59.187233Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: "failed to parse .message as JSON: level=debug ts=2022-03-29T
18:28:49.579Z caller=cluster.go:329 component=cluster memberlist=\"2022/03/29 18:28:49 [DEBUG] memberlist: Initiating push/pull sync with: xxx 10.0.167.205:9094\\n\"" internal_log_rate_secs=1 vrl
_position=53
2022-03-29T18:28:59.187428Z INFO transform{component_kind="transform" component_id=setlevel component_type=remap component_name=setlevel}: vrl_stdlib::log: Internal log ["failed to parse .message as JSON: W0329 18:19:
7.170407 1 warnings.go:70[] discovery.k8s.io/v1beta1 EndpointSlice is deprecated in v1.21+, unavailable in v1.25+; use discovery.k8s.io/v1 EndpointSlice"] is being rate limited.
This has just started happening again, no error or warning logs. Anything else I should be looking for or trying? We can't be dropping our production logs.
@jszwedko one thing I am noticing is that we don't get the logs about vector finding new files. This is easier to reproduce on things that come and go relatively quickly like cron jobs (we leave the completed pod around so logs can be collected)
One thing to add: We checked the mounted /var/logs
and the files with the missing logs are there, it seems like the event about the new files doesn't get triggered. We're running on AWS EKS and /proc/sys/fs/inotify/max_user_watches
is 524288.
@danthegoodman1 If new logs were not "found" in time, maybe try to reduce glob_minimum_cooldown_ms?
It's 60000
ms in kubernetes_logs
source, but 1000
ms in file
source.
@wtchangdm will try, but itβs the case for long running pods too where it just stops recording logs for those pods. Those files definitely stay around longer than 60s
@danthegoodman1 BTW, there is an issue I am also tracking: https://github.com/vectordotdev/vector/issues/8616. Not sure if it's related, though.
Interesting, but we donβt get any warnings or errors like that does.
@danthegoodman1 could you try with the latest release, 0.21.1? The kubernetes_logs
source was rewritten there to use the community supported kube-rs
crate to interact with Kubernetes that we expect may have fixed a few issues.
Though, in this case, it does seem like the issue is with file discovery rather than metadata enrichment.
@jszwedko I will say that so far we haven't noticed any issues since we dropped the glob cooldown to 8000
, but we also surely have not been in our logs as much.
Will update now and see how it works
@jszwedko This went very not well, was printing this extremely quickly:
2022-04-25T18:40:35.870916Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
2022-04-25T18:40:35.871396Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
2022-04-25T18:40:35.872182Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
2022-04-25T18:40:35.873367Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is
forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
2022-04-25T18:40:35.874306Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid
den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
~Ah ok I needed to update the cluster role tags too :)~ still getting it...
@jszwedko This went very not well, was printing this extremely quickly:
2022-04-25T18:40:35.870916Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })) 2022-04-25T18:40:35.871396Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })) 2022-04-25T18:40:35.872182Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })) 2022-04-25T18:40:35.873367Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "namespaces is forbidden: User \"system:serviceaccount:observability:vector\" cannot list resource \"namespaces\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 })) 2022-04-25T18:40:35.874306Z WARN vector::kubernetes::reflector: Watcher Stream received an error. Retrying. error=InitialListFailed(Api(ErrorResponse { status: "Failure", message: "pods is forbid den: User \"system:serviceaccount:observability:vector\" cannot list resource \"pods\" in API group \"\" at the cluster scope", reason: "Forbidden", code: 403 }))
The ClusterRole
needs to allow for the list
verb as well now, see here
yep, just made that change :)
After sending ~50log/s I can see the following errors show, then logs stop shipping. It prints a ton of the error logs.
It is able to ship them for about 30 seconds before it stops sending logs
2022-04-26T13:21:29.462788Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Writing slot 131412285 for consumer solanascanner-1\\\",\\\"time\\\":\\\"2022-04-26T13:21:
29.450Z\\\"}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.45062706Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(20
22-04-26T13:21:29.462766592Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definiti
on: Definition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_f
ailed" stage="processing"
2022-04-26T13:21:29.497575Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.StoreSc
annerSlotOffset\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Wrote offset in 35.128858ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.485Z\\\"}\\n\",\"stre
am\":\"stderr\",\"time\":\"2022-04-26T13:21:29.485822529Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:29.49755
5692Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition { collec
tion: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="processin
g"
2022-04-26T13:21:29.505901Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got slots in 167.388115ms got slots [131412285], len 1\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\
"}\\n\",\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502718488Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T
13:21:29.505885264Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Defin
ition { collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" st
age="processing"
2022-04-26T13:21:29.505972Z ERROR source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}: vector::internal_events::kubernetes_logs: Failed to annotate event with pod metadata. event=Log(LogEvent { fields: Object({"file": Bytes(b"/var/log/po
ds/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3ef46c17c9cd/solanascanner/0.log"), "message": Bytes(b"{\"log\":\"{\\\"func\\\":\\\"main.ScanBlo
ck\\\",\\\"level\\\":\\\"debug\\\",\\\"msg\\\":\\\"Got same top slot 131412285, sleeping for 250ms\\\",\\\"time\\\":\\\"2022-04-26T13:21:29.502Z\\\"}\\n\"
,\"stream\":\"stderr\",\"time\":\"2022-04-26T13:21:29.502731568Z\"}"), "source_type": Bytes(b"kubernetes_logs"), "timestamp": Timestamp(2022-04-26T13:21:2
9.505966965Z)}), metadata: EventMetadata { datadog_api_key: None, splunk_hec_token: None, finalizers: EventFinalizers([]), schema_definition: Definition {
collection: Collection { known: {}, unknown: None }, meaning: {}, optional: {} } } }) error_code="annotation_failed" error_type="reader_failed" stage="pr
ocessing"
2022-04-26T13:21:29.576577Z INFO source{component_kind="source" component_id=kubernetes_logs component_type=kubernetes_logs component_name=kubernetes_log
s}:file_server: vector::internal_events::file::source: Stopped watching file. file=/var/log/pods/solana-scanner_solanascanner-1_81d00edf-3189-4a43-867f-3e
f46c17c9cd/solanascanner/0.log
It does not affect watching other pods, and CPU never exceeds ~20 mCPU and MEM never exceeds 33 MB for the vector pods
Seems related to https://github.com/vectordotdev/vector/issues/10122#issuecomment-1026006657
Seems related to #10122 (comment)
Possibly - the other answer would be that solanascanner
was deleted in cluster before all of it's log lines were processed. When that happens delay_deletion_ms
comes into play as it determines how long we hold onto old metadata before dumping it ourselves. Increasing that value will increase memory usage (however much it takes to hold onto the resource metadata), but allow for more time to enrich logs from deleted pods.
It seems reducing the glob time to 3s is working at this rate. I also reduced the number of logs printed to reduce the frequency of log rotation
Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well
Still getting the errors, and the pods are not being removed, logs are coming in much slower now as well
Can you share the setup you're using so I can reproduce?