kuma
kuma copied to clipboard
MeshAccessLog - add support for timeout and retry for TCP backend
Description
Hi,
When sending Kuma logs via DataDog service in K8s we get sporadically error messages from kuma-sidekar
with failures to send logs.
It would be nice to add support for configuring timeout & nr. of retries in case of TCP connection failure.
Here is an example of exception raised:
2023-11-10T14:47:39.554Z ERROR access-log-streamer could not send the log to TCP log destination. Dropping the log {"address": "datadog-kuma-tcp.datadog.svc.cluster.local:10518", "address": "datadog-kuma-tcp.datadog.svc.cluster.local:10518", "error": "failed to send a log entry to a TCP logging backend: datadog-kuma-tcp.datadog.svc.cluster.local:10518: write tcp 10.255.43.220:44174->172.20.167.212:10518: write: connection timed out", "errorVerbose": "write tcp 10.255.43.220:44174->172.20.167.212:10518: write: connection timed out
failed to send a log entry to a TCP logging backend: datadog-kuma-tcp.datadog.svc.cluster.local:10518
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*logSender).send
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/sender.go:33
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*accessLogStreamer).streamAccessLogs
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/streamer.go:163
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*accessLogStreamer).Start.func2
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/streamer.go:72
runtime.goexit
\truntime/asm_amd64.s:1650"}
Triage: did you consider using OpenTelemetry? It might support this already (on the agent side)
@jakubdyszkiewicz but opentelemetry are for metrics, and accessLogStreamer is for logs, isnt it?
OpenTelemetry can also be used for logs.
Here we support this in MeshAccessLog https://kuma.io/docs/2.5.x/policies/meshaccesslog/#opentelemetry
And it seems that Datadog agent now has a builtin support for it https://www.datadoghq.com/blog/agent-otlp-log-ingestion/
@jakubdyszkiewicz thank you for the hint. I will check proposed configuration and will let you know if it works. I was using datadog tcp collector as i think it was recommended in the docs, but i would explore solution proposed above.
Thanks @jakubdyszkiewicz for the suggestion, not sure how OpenTelemetry would help here. Does the OpenTelemetry client have support for retry/timeout settings ? Reason being is that if the backend on DataDog is not ready (temporarily) for any reason to accept connections - it would fail similar to TCP backend.
@jakubdyszkiewicz i see that if oltp logs are specified kuma
just configures envoy to send the logs in openmetrics format. But i did not found any retry logic/settings/documentation in envoy about it. Do you know any details?
@jakubdyszkiewicz I was able to get OTLP logs working, however, it also has own problems:
- It is not documented that openmetric logs are using GRPC, was able to find it from the source code and rendered envoy config. This should be mentioned in the doc, as OTLP supports both.
- Field
openTelemetry.attributes
fields do not accept%KUMA_*%
placeholders. Moreover, it renders bad config for envoy, so it stops accepting any changes. -
openTelemetry.body
field accepts%KUMA_*%
placeholders, but it does escaping automatically, so json is getting escaped and not parsed by DD correctly - I also tried to put to
openTelemetry.body
object instead of string (just usingtest: "value"
but seems that such config is silently ignored by Kuma and it reverts to default text line
Could you please help with that?
Actually https://github.com/kumahq/kuma/blob/master/pkg/plugins/policies/meshaccesslog/plugin/xds/configurer.go#L313 missing interpolations
- openTelemetry.body field accepts %KUMA_*% placeholders, but it does escaping automatically, so json is getting escaped and not parsed by DD correctly
@samm-git could you copy/paste an example of what you mean?
@michaelbeaumont sure.
openTelemetry:
body: '{"test":"value"}'
would render as escaped
"{\"test\": \"value\"}"
openTelemetry:
body:
test: "value"
gets ignored and default is in use (why?)
openTelemetry:
body:
- test: value
Actually gets rendered to the array
[{"test":"value"}]
what again prevents datadog from creating an index.
Also, i did an experiment - i tried to kill Datadog daemonset to see how errors are handled.
From what i see - envoy silently dropped all messages when datadog agent was down. No errors were reported, nothing. So it's even worse compared to the original behavior, at least we had some reference that data is incomplete. @jakubdyszkiewicz ^^
From what i see in Envoy rendered config - MeshAccessLogs does not configure any retry settings
{
"log_name": ...,
"grpc_service": {...},
"transport_api_version": ...,
"buffer_flush_interval": {...},
"buffer_size_bytes": {...},
"filter_state_objects_to_log": [],
"grpc_stream_retry_policy": {...},
"custom_tags": []
}
Most of the things related to retry and log flush are configured via buffer_size_bytes
and grpc_stream_retry_policy
. As we have only default values which are very small, logs likely will be lost.
Hey, I did some investigation on the "body" and our docs are lacking, or rather examples.
Our docs says
Body is a raw string or an OTLP any value as described at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#field-body It can contain placeholders available on https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/accesslog/usage#command-operators_
This body field is field from OTEL and it's quite complex. If you want to do key value you can do this
apiVersion: kuma.io/v1alpha1
kind: MeshAccessLog
metadata:
name: default
namespace: kuma-system
spec:
targetRef:
kind: Mesh
from:
- targetRef:
kind: Mesh
default:
backends:
- type: OpenTelemetry
openTelemetry:
endpoint: otel-collector.observability.svc:4317
body:
kvlistValue:
values:
- key: "mesh"
value:
stringValue: "%KUMA_MESH%"
KUMA_MESH is then interpolated just fine.
We should also do a better job at failing when the format is not right instead of fallback to a default. I agree that it's confusing.
When it comes to retries. I think grpc_stream_retry_policy
could help. Let my try to compose MeshProxyPatch to configure this. If this becomes useful we could incorporate this into MeshAccessLog policy
@jakubdyszkiewicz thank you. May i kindly ask to add interpolation to headers as well? As in fact, we don't need a body at all, these all could be injected as OTLP attributes. Plus it is very confusing to have different rules for headers and for the body
@jakubdyszkiewicz JFYI - example
default:
backends:
- type: OpenTelemetry
openTelemetry:
endpoint: otel-collector.observability.svc:4317
body:
kvlistValue:
values:
- key: "mesh"
value:
stringValue: "%KUMA_MESH%"
works as expected and DataDog is happy with it and indexing the mesh
field. I think it would be good to update the documentation with it as it is very confusing atm.
Ok, so to test retry you can check this proxypatch
apiVersion: kuma.io/v1alpha1
kind: MeshProxyPatch
metadata:
name: custom-template-1
namespace: kuma-system
spec:
targetRef:
kind: Mesh # use MeshService or MeshSubset if you want to apply this only on specific proxies
default:
appendModifications:
- networkFilter:
operation: Patch
match:
name: envoy.filters.network.http_connection_manager
# origin: inbound # optional to if you want to add this on incoming listeners
# listenerName: "inbound:10.42.0.9:3001" # optional if you want to add this to specific listener
jsonPatches:
- op: add
path: /accessLog/0/typedConfig/commonConfig/grpcStreamRetryPolicy
value:
numRetries: 5
retryBackOff:
baseInterval: "1s"
maxInterval: "10s"
I agree on docs. Created issue https://github.com/kumahq/kuma-website/issues/1538 Interpolation of attributes makes sense https://github.com/kumahq/kuma/issues/8492
Let me know how this retry goes.
I added above example to OpenAPI schema of MeshAccessLog policy: https://github.com/kumahq/kuma/pull/8533/files#diff-ed8e7e0fddb480be4d601ae35bb953cddd53659fb3b9d306b43996a8f322cb38R82
This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.