talos icon indicating copy to clipboard operation
talos copied to clipboard

support sending logs in Loki format

Open flokli opened this issue 3 years ago • 14 comments
trafficstars

Feature Request

Right now, Talos supports sending logs as json lines via UDP and TCP:

https://www.talos.dev/docs/v0.14/guides/logging/#sending-logs

Is it possible to also add native support to push to a Loki endpoint, as described in https://grafana.com/docs/loki/latest/api/#post-lokiapiv1push ?

flokli avatar Jan 26 '22 18:01 flokli

We certainly don't want to implement every push API in Talos itself, and as usually collecting logs means also collecting pod logs, running something in between Talos and Loki is more appropriate.

smira avatar Jan 26 '22 18:01 smira

I understand you don't want to implement every log push method on the planet.

However, lines of json over TCP or UDP provides no encryption etc - which requires a "trusted network" in the datacenter.

The loki format is quite simple, essentially allowing to send JSON over HTTP(S).

You can go more fancy, use the Go Client libraries and send snappy-compressed protobufs, but Loki-compatible JSON over HTTPS should be pretty low footprint.

flokli avatar Jan 26 '22 21:01 flokli

The idea was to push to the cluster local Filebeat for example which will do the proper forwarding

smira avatar Jan 26 '22 22:01 smira

Please don't treat my answer as 'no, we won't ever implement that', but supporting Loki directly adds another group of questions - what should be the field names, etc. With software like Filebeat one can easily map the fields to match the desired structure, not possible with Talos itself.

smira avatar Jan 27 '22 12:01 smira

The "local" part in FileBeat, and host networking bits were not clear to me. I opened https://github.com/talos-systems/talos/pull/4893 to clarify this, PTAL.

Note FIleBeat doesn't seem to support forwarding to Loki endpoints natively - so maybe FluentBit might be the better recommendation here - it seems to support many more output formats, and be smaller footprint. I didn't play with FluentBit in this context yet, so no example config, sorry :laughing:

Please don't treat my answer as 'no, we won't ever implement that', but supporting Loki directly adds another group of questions - what should be the field names, etc.

Agreed - field names should probably follow some best practices. Happy to give some input once I looked at the logs that Talos is currently sending.

flokli avatar Jan 27 '22 23:01 flokli

@flokli I believe @rgl has some examples here on using the talos logging to forward to Loki https://github.com/rgl/talos-vagrant

frezbo avatar Jan 28 '22 08:01 frezbo

Yeah, I looked around a bit, and stumbled over https://github.com/rgl/talos-vagrant/commit/c771d754c43ca954f43657d2c65d1ff77be1b1ff.

This seems to run a separate docker container with Vector to receive and forward logs (see config), but I assume this could be converted to a in-talos DaemonSet quite trivially.

I'll circle back to it eventually, thanks for the pointer!

flokli avatar Jan 30 '22 16:01 flokli

@smira, the DaemonSet approach is quite interesting. I suppose it will not work when kubernetes is not yet up (or if it fails to go up)?

rgl avatar Jan 30 '22 22:01 rgl

In the case of sending logs via udp, those will be lost.

Without reading too much of internal/app/machined/pkg/runtime/logging, I assume there's some ring buffer (capped to 1M), so buffering TCP could be a thing - but I'm not sure if messages are buffered in it while it's still trying to establish the connection, and how retry behavior looks like.

flokli avatar Jan 31 '22 07:01 flokli

@smira, the DaemonSet approach is quite interesting. I suppose it will not work when kubernetes is not yet up (or if it fails to go up)?

yes, DaemonSet approach requires Kubernetes to be up

one more option is to use static pods #4727

smira avatar Feb 02 '22 15:02 smira

In the case of sending logs via udp, those will be lost.

Without reading too much of internal/app/machined/pkg/runtime/logging, I assume there's some ring buffer (capped to 1M), so buffering TCP could be a thing - but I'm not sure if messages are buffered in it while it's still trying to establish the connection, and how retry behavior looks like.

Basically Talos will try to buffer as much as it can until the endpoint is up.

Certainly for the logging to be reliable, logging endpoint should be outside of the cluster. This might not be always possible, so there's some compromise.

Talos will retry forever to send the logs, but if there are more than 1 MiB of logs, some older logs will be dropped.

smira avatar Feb 02 '22 15:02 smira

I circled back to that, and set up a Deployment with vector running inside the cluster.

I configured machine.logging.destinations[] accordingly:

destinations:
 - endpoint: tcp://vector-headless.talos-logforward.svc:6051
   format: json_lines

I also configured kernel logs (through machine.install.extraKernelArgs[]):

extraKernelArgs:
 - "talos.logging.kernel=tcp://vector-headless.talos-logforward.svc:6050/"

While the first logs seem to get ingested properly, I'm not receiving any kernel logs, even when applying the machine config with apply-config -m reboot.

Is there some reason the dns resolving works differently for the kernel logs?

flokli avatar Jun 30 '22 02:06 flokli

The kernel logs only starts working after you do an upgrade (upgrading to same version of talos also works). Talos only appends extra kernel args on upgrades only.

frezbo avatar Jun 30 '22 09:06 frezbo

wow, that was surprising. I opened https://github.com/siderolabs/talos/pull/5845 to make it more clear.

flokli avatar Jul 01 '22 07:07 flokli