kuma icon indicating copy to clipboard operation
kuma copied to clipboard

MeshAccessLog - add support for timeout and retry for TCP backend

Open Constantin07 opened this issue 1 year ago • 19 comments

Description

Hi, When sending Kuma logs via DataDog service in K8s we get sporadically error messages from kuma-sidekar with failures to send logs. It would be nice to add support for configuring timeout & nr. of retries in case of TCP connection failure.

Constantin07 avatar Nov 13 '23 18:11 Constantin07

Here is an example of exception raised:

2023-11-10T14:47:39.554Z	ERROR	access-log-streamer	could not send the log to TCP log destination. Dropping the log	{"address": "datadog-kuma-tcp.datadog.svc.cluster.local:10518", "address": "datadog-kuma-tcp.datadog.svc.cluster.local:10518", "error": "failed to send a log entry to a TCP logging backend: datadog-kuma-tcp.datadog.svc.cluster.local:10518: write tcp 10.255.43.220:44174->172.20.167.212:10518: write: connection timed out", "errorVerbose": "write tcp 10.255.43.220:44174->172.20.167.212:10518: write: connection timed out
failed to send a log entry to a TCP logging backend: datadog-kuma-tcp.datadog.svc.cluster.local:10518
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*logSender).send
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/sender.go:33
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*accessLogStreamer).streamAccessLogs
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/streamer.go:163
github.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs.(*accessLogStreamer).Start.func2
\tgithub.com/kumahq/kuma/app/kuma-dp/pkg/dataplane/accesslogs/streamer.go:72
runtime.goexit
\truntime/asm_amd64.s:1650"}

cbugneac-nex avatar Nov 16 '23 10:11 cbugneac-nex

Triage: did you consider using OpenTelemetry? It might support this already (on the agent side)

jakubdyszkiewicz avatar Nov 20 '23 15:11 jakubdyszkiewicz

@jakubdyszkiewicz but opentelemetry are for metrics, and accessLogStreamer is for logs, isnt it?

samm-git avatar Nov 27 '23 08:11 samm-git

OpenTelemetry can also be used for logs.

Here we support this in MeshAccessLog https://kuma.io/docs/2.5.x/policies/meshaccesslog/#opentelemetry

And it seems that Datadog agent now has a builtin support for it https://www.datadoghq.com/blog/agent-otlp-log-ingestion/

jakubdyszkiewicz avatar Nov 27 '23 09:11 jakubdyszkiewicz

@jakubdyszkiewicz thank you for the hint. I will check proposed configuration and will let you know if it works. I was using datadog tcp collector as i think it was recommended in the docs, but i would explore solution proposed above.

samm-git avatar Nov 27 '23 09:11 samm-git

Thanks @jakubdyszkiewicz for the suggestion, not sure how OpenTelemetry would help here. Does the OpenTelemetry client have support for retry/timeout settings ? Reason being is that if the backend on DataDog is not ready (temporarily) for any reason to accept connections - it would fail similar to TCP backend.

cbugneac-nex avatar Nov 27 '23 09:11 cbugneac-nex

@jakubdyszkiewicz i see that if oltp logs are specified kuma just configures envoy to send the logs in openmetrics format. But i did not found any retry logic/settings/documentation in envoy about it. Do you know any details?

samm-git avatar Nov 27 '23 11:11 samm-git

@jakubdyszkiewicz I was able to get OTLP logs working, however, it also has own problems:

  1. It is not documented that openmetric logs are using GRPC, was able to find it from the source code and rendered envoy config. This should be mentioned in the doc, as OTLP supports both.
  2. Field openTelemetry.attributes fields do not accept %KUMA_*% placeholders. Moreover, it renders bad config for envoy, so it stops accepting any changes.
  3. openTelemetry.body field accepts %KUMA_*% placeholders, but it does escaping automatically, so json is getting escaped and not parsed by DD correctly
  4. I also tried to put to openTelemetry.body object instead of string (just using test: "value" but seems that such config is silently ignored by Kuma and it reverts to default text line

Could you please help with that?

samm-git avatar Nov 28 '23 13:11 samm-git

Actually https://github.com/kumahq/kuma/blob/master/pkg/plugins/policies/meshaccesslog/plugin/xds/configurer.go#L313 missing interpolations

samm-git avatar Nov 28 '23 13:11 samm-git

  1. openTelemetry.body field accepts %KUMA_*% placeholders, but it does escaping automatically, so json is getting escaped and not parsed by DD correctly

@samm-git could you copy/paste an example of what you mean?

michaelbeaumont avatar Nov 28 '23 13:11 michaelbeaumont

@michaelbeaumont sure.

openTelemetry:
  body: '{"test":"value"}'

would render as escaped

"{\"test\": \"value\"}"
openTelemetry:
  body:
    test: "value"

gets ignored and default is in use (why?)

openTelemetry:
  body:
    - test: value

Actually gets rendered to the array

[{"test":"value"}]

what again prevents datadog from creating an index.

samm-git avatar Nov 28 '23 13:11 samm-git

Also, i did an experiment - i tried to kill Datadog daemonset to see how errors are handled.

From what i see - envoy silently dropped all messages when datadog agent was down. No errors were reported, nothing. So it's even worse compared to the original behavior, at least we had some reference that data is incomplete. @jakubdyszkiewicz ^^

samm-git avatar Nov 28 '23 14:11 samm-git

From what i see in Envoy rendered config - MeshAccessLogs does not configure any retry settings

{
  "log_name": ...,
  "grpc_service": {...},
  "transport_api_version": ...,
  "buffer_flush_interval": {...},
  "buffer_size_bytes": {...},
  "filter_state_objects_to_log": [],
  "grpc_stream_retry_policy": {...},
  "custom_tags": []
}

Most of the things related to retry and log flush are configured via buffer_size_bytes and grpc_stream_retry_policy. As we have only default values which are very small, logs likely will be lost.

samm-git avatar Nov 28 '23 14:11 samm-git

Hey, I did some investigation on the "body" and our docs are lacking, or rather examples.

Our docs says

Body is a raw string or an OTLP any value as described at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/logs/data-model.md#field-body It can contain placeholders available on https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/accesslog/usage#command-operators_

This body field is field from OTEL and it's quite complex. If you want to do key value you can do this

apiVersion: kuma.io/v1alpha1
kind: MeshAccessLog
metadata:
  name: default
  namespace: kuma-system
spec:
  targetRef:
    kind: Mesh
  from:
    - targetRef:
        kind: Mesh
      default:
        backends:
          - type: OpenTelemetry
            openTelemetry:
              endpoint: otel-collector.observability.svc:4317
              body:
                kvlistValue:
                  values:
                  - key: "mesh"
                    value:
                      stringValue: "%KUMA_MESH%"

KUMA_MESH is then interpolated just fine.

We should also do a better job at failing when the format is not right instead of fallback to a default. I agree that it's confusing.

When it comes to retries. I think grpc_stream_retry_policy could help. Let my try to compose MeshProxyPatch to configure this. If this becomes useful we could incorporate this into MeshAccessLog policy

jakubdyszkiewicz avatar Nov 29 '23 15:11 jakubdyszkiewicz

@jakubdyszkiewicz thank you. May i kindly ask to add interpolation to headers as well? As in fact, we don't need a body at all, these all could be injected as OTLP attributes. Plus it is very confusing to have different rules for headers and for the body

samm-git avatar Nov 29 '23 17:11 samm-git

@jakubdyszkiewicz JFYI - example

      default:
        backends:
          - type: OpenTelemetry
            openTelemetry:
              endpoint: otel-collector.observability.svc:4317
              body:
                kvlistValue:
                  values:
                  - key: "mesh"
                    value:
                      stringValue: "%KUMA_MESH%"

works as expected and DataDog is happy with it and indexing the mesh field. I think it would be good to update the documentation with it as it is very confusing atm.

samm-git avatar Nov 29 '23 17:11 samm-git

Ok, so to test retry you can check this proxypatch

apiVersion: kuma.io/v1alpha1
kind: MeshProxyPatch
metadata:
  name: custom-template-1
  namespace: kuma-system
spec:
  targetRef:
    kind: Mesh # use MeshService or MeshSubset if you want to apply this only on specific proxies
  default:
    appendModifications:
      - networkFilter:
          operation: Patch
          match:
            name: envoy.filters.network.http_connection_manager
            # origin: inbound # optional to if you want to add this on incoming listeners
            # listenerName: "inbound:10.42.0.9:3001" # optional if you want to add this to specific listener
          jsonPatches:
            - op: add
              path: /accessLog/0/typedConfig/commonConfig/grpcStreamRetryPolicy
              value:
                numRetries: 5
                retryBackOff:
                  baseInterval: "1s"
                  maxInterval: "10s"

I agree on docs. Created issue https://github.com/kumahq/kuma-website/issues/1538 Interpolation of attributes makes sense https://github.com/kumahq/kuma/issues/8492

Let me know how this retry goes.

jakubdyszkiewicz avatar Nov 30 '23 09:11 jakubdyszkiewicz

I added above example to OpenAPI schema of MeshAccessLog policy: https://github.com/kumahq/kuma/pull/8533/files#diff-ed8e7e0fddb480be4d601ae35bb953cddd53659fb3b9d306b43996a8f322cb38R82

bartsmykla avatar Dec 05 '23 06:12 bartsmykla

This issue was inactive for 90 days. It will be reviewed in the next triage meeting and might be closed. If you think this issue is still relevant, please comment on it or attend the next triage meeting.

github-actions[bot] avatar Jun 24 '24 07:06 github-actions[bot]