cloud-on-k8s Enable beats stack monitoring configuration

Closes #5563

This change enables easy stack monitoring configuration for Beats, such as we already have for both Elasticsearch and Kibana, by adding the following configuration stanza in the Beats CRD

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
spec:
  monitoring:
    metrics:
      elasticsearchRefs:
      - name: elasticsearch
    logs:
      elasticsearchRefs:
      - name: elasticsearch

Jul 18 '22 15:07 naemono

@pebrc I've reached out to observability, and they also suggested we use sidecars to monitor these instead of internal collectors. My changes to allow sidecars is nearly complete, with some unit tests for the functionality being taken care of, and I'll note that this is ready for another look.

Aug 02 '22 21:08 naemono

I left a few comments. In my tests the metrics never showed up in the "Stack monitoring" view in Kibana. It wasn't clear to me what was the root cause for this (e.g. I added the missing UUID) but I guess it warrants further investigation.

Yeah, it's not showing up in stack monitoring for some reason for me either, but the data is definitely getting into the indices (same for filebeat, but not shown)

I'll figure out the reason for the missing stack monitoring kibana piece.

Also, I'm working through the comments....

Aug 11 '22 18:08 naemono

I believe this is now ready for more 👀 . I'm just verifying the e2e tests at this time.

Aug 16 '22 16:08 naemono

run/e2e-tests tags=beat

Aug 16 '22 21:08 naemono

run/e2e-tests tags=beat

Aug 17 '22 13:08 naemono

run/e2e-tests tags=beat

Aug 17 '22 16:08 naemono

I did a quick test, haven't looked at the code yet. But somehow the UI still does not show content

Screenshot 2022-08-18 at 21 03 50

There is data flowing into my monitoring cluster though. Not sure what is going on with the stack monitoring UI. Going to take a closer look tomorrow.

green open .ds-metricbeat-8.3.2-2022.08.18-000001       qnmXL7hNQeKIecbdeJ3yhQ 1 1    99 0 214.2kb 106.4kb
green open .ds-filebeat-8.3.2-2022.08.18-000001         Wwi6mAX0T1Cczfs-dQDc4w 1 1 12178 0   2.8mb   1.3mb
green open .ds-.monitoring-beats-8-mb-2022.08.18-000001 851mItfaS7iiCbYwB5yQ2g 1 1   139 0 511.1kb 248.6kb

Aug 18 '22 19:08 pebrc

All that is ingested are the sidecars own logs.

I'm working through this... seeing odd issue where something is listening on 5066, even though I am disabling this http endpoint in both filebeat, and metricbeat sidecars. I'll update when it's resolved.

Exiting: could not start the HTTP server for the API: listen tcp 127.0.0.1:5066: bind: address already in use

Aug 24 '22 15:08 naemono

metricbeat, and filebeat (or any 2x beats) running on the same host, listening on 127.0.0.1:5066 seem to be the cause of this... I'm investigating a solution.

metricbeat-beat-metricbeat-4bs56            2/3     CrashLoopBackOff   10 (68s ago)   27m   10.142.0.41   gke-mmontgomery-testcluster01-pool-1-9b30cf4a-9eg2   <none>           <none>
filebeat-beat-filebeat-cx52x                2/2     Running            2 (27m ago)    27m   10.142.0.41   gke-mmontgomery-testcluster01-pool-1-9b30cf4a-9eg2   <none>           <none>

Aug 24 '22 15:08 naemono

metricbeat, and filebeat (or any 2x beats) running on the same host, listening on 127.0.0.1:5066 seem to be the cause of this... I'm investigating a solution.

As I suspected, this is the culprit. If any 2x beats, with stack monitoring, and hostNetwork: true end up on the same host, this port conflict will occur...

        hostNetwork: true

I'm thinking through some sort of solution to this that's deterministic in which all ports, assigned to any beat with stack monitoring enabled can be consistent for each reconciliation cycle.

Aug 26 '22 01:08 naemono

I'm thinking through some sort of solution to this that's deterministic in which all ports,

Could you give more details about your solution?

If deployed as a daemonset do we want to use the same port on all the nodes?
How to not collide with ports already opened on the node?
Is there any security consideration to have?
Any chance to have the Beats team working on a Unix socket based interface?

Aug 31 '22 05:08 barkbay

@barkbay I don't currently have a solution to this problem. I'll document here some potential (partial) solutions here that I've been working through, but none of them fully stop the potential for port conflicts when using hostNetwork: true, but that's true for ANY pod that uses this setting...

Ensure distinct `http.port` setting for each Beat daemonset, with stack monitoring enabled, and `hostNetwork: true` set.

For each Beat that is both a daemonset, and has stack monitoring enabled with hostNetwork: true set, we could have a package that returns a distinct http.port, starting at 5066 and incrementing.

We would have to 'save' the assigned stack monitoring port (likely) in the status of the Beat object for future queries.
This wouldn't fully resolve the issue, as a customer could have something running on 5066 on the host already, which would collide with the Beat.
Multiple operator instances running within the same cluster would potentially run into issues, as they wouldn't share this data, unless we persist this data in a shared configmap that multiple operators query.

Adjust the `monitoring` section for Beat to require that the user assign a monitoring port when `hostNetwork: true` is set.

This solution move the onus to the user to ensure that the stack monitoring ports for Beat within the cluster do not collide.

Again, multiple operators running within the same cluster could run into issues, as they may not have the RBAC rules to query Beats outside of their assigned namespaces for port assignments when validating the Beat object upon initial validation/creation.
Also has the potential problem where the k8s nodes may have something already running on the node, on the assigned port, that the user is unaware of.

Document the potential port conflict, and remove any `hostNetwork: true` settings from Beat types that do not require it.

There are indications that hostNetwork: true is not fully required for Beats to function in certain Beat types, see issue comment. If this is true, we could remove hostNetwork: true from our recipes/documentation for the Beats that do not require this setting (metricbeat/packetbeat seem to be the only ones that should 'require' this setting). In this scenario, we could document the potential for port conflicts when hostnetwork is used, and leave it up to the user to avoid this scenario.

Let me know if you have strong feelings for/against these options. I personally do not like assigning ports starting at 5066 and incrementing, as it seems likely run into a port conflict issue sooner or later, and the user would be pretty blind to the reason it's happening. It also seems as though we should pursue removing hostNetwork: true from our recipes for Beats that do not require it regardless of the final solution we pursue here.

To answer the questions you specifically asked:

If deployed as a daemonset do we want to use the same port on all the nodes?

I think you're asking if 2x beat daemonsets should use the same port for each daemonset. I'd say likely not if stack monitoring is enabled, and hostNetwork: true is set for both of them.

How to not collide with ports already opened on the node?

I don't know of any way to stop this from happening when a server on the k8s node is run outside of k8s that binds to a port.

Is there any security consideration to have?

We are opening port 5066 on the node's host network when stack monitoring is enabled, so any process on the host could query the Beat's metrics, so that's a potential security risk depending on the data that's served over the monitoring server.

Any chance to have the Beats team working on a Unix socket based interface?

~~I will bring this up with the Beats team to see their thoughts.~~ Unix sockets/Windows pipes are supported from the documentation. I'll test this option.

Aug 31 '22 14:08 naemono

@barkbay unix sockets/windows pipes appear to be a valid option, which could be named according to namespace/name, and solve this problem: https://www.elastic.co/guide/en/beats/metricbeat/current/http-endpoint.html.

I'll do some testing and see what comes of this...

Aug 31 '22 14:08 naemono

Beats stack monitoring now works over unix sockets, which solves the issues where we were getting port conflicts on host network.

Sep 01 '22 20:09 naemono

run/e2e-tests tags=beat

Sep 01 '22 21:09 naemono

This looks good! I'm testing with this manifest and I get a socket permission error for Filebeat:

{
    "log.level": "error",
    "@timestamp": "2022-09-09T13:40:34.362Z",
    "log.origin": {
        "file.name": "module/wrapper.go",
        "file.line": 256
    },
    "message": "Error fetching data for metricset beat.state: error making http request:
 Get \"http://unix/state\": dial 
 unix /var/shared/filebeat-default-filebeat.sock: connect: permission denied",
    "service.name": "metricbeat",
    "ecs.version": "1.6.0"
}

I'm investigating these permission issues...

Sep 12 '22 13:09 naemono

@thbkrkr meant to update yesterday. The permission errors were from attempting to use /usr/share/filebeat-sidecar as the path for the filbeat sidecar, which was unnecessary. I removed that from the config template, tested your manifest locally, and saw no issues. This should be ready for 👀 again.

Sep 13 '22 13:09 naemono

@thbkrkr There were additional issues found surrounding sidecar permissions reading the unix socket. These were resolved, and tests were added around this feature. It's again ready for 👀

Sep 14 '22 15:09 naemono

I stepped into a trap when testing this yesterday by using one of our existing manifests with the -e option and only with @thbkrkr 's help was able to identify that this was the reason why my logs would not show up in the monitoring cluster.

I think we have a usability issue here with some of the existing recipes using the -e option. We can assume that some of our users also might have specified it when overriding the default beats command. I think we should at least document this requirement and maybe even consider, as @thbkrkr suggested, to automatically remove the -e option if log delivery is configured.

Sep 21 '22 13:09 pebrc

I think we have a usability issue here with some of the existing recipes using the -e option. We can assume that some of our users also might have specified it when overriding the default beats command. I think we should at least document this requirement and maybe even consider, as @thbkrkr suggested, to automatically remove the -e option if log delivery is configured.

I didn't think of the scenario where the customer/recipe already had the -e option. Resolved in https://github.com/elastic/cloud-on-k8s/pull/5878/commits/52cd829cc6259cc1bdbd81158f845318bcd384f6. I'll also get the documentation updated to note this.

Sep 21 '22 16:09 naemono

@pebrc documentation added. Let me know how you feel about the wording.

Sep 21 '22 19:09 naemono

cloud-on-k8s cloud-on-k8s copied to clipboard

Enable beats stack monitoring configuration

Ensure distinct http.port setting for each Beat daemonset, with stack monitoring enabled, and hostNetwork: true set.

Adjust the monitoring section for Beat to require that the user assign a monitoring port when hostNetwork: true is set.

Document the potential port conflict, and remove any hostNetwork: true settings from Beat types that do not require it.

cloud-on-k8s
cloud-on-k8s copied to clipboard

Ensure distinct `http.port` setting for each Beat daemonset, with stack monitoring enabled, and `hostNetwork: true` set.

Adjust the `monitoring` section for Beat to require that the user assign a monitoring port when `hostNetwork: true` is set.

Document the potential port conflict, and remove any `hostNetwork: true` settings from Beat types that do not require it.