cloud-on-k8s icon indicating copy to clipboard operation
cloud-on-k8s copied to clipboard

Elastic-Agent in external k8s-cluster does not send data after installation

Open derbl4ck opened this issue 2 years ago • 2 comments

Bug Report

What did you do? I run several Kubernetes clusters (AKS; Ubuntu) in which microservices and other applications are provisioned, as well as a dedicated AKS in which Kibana, elasticsearch, fleet server and elastic agents are located. These are provided by the eck-operator v2.2.0. I have made and attached a graphic to illustrate this.

Kibana, elasticsearch, fleet server and APM server are provisioned through nginx-ingress on the network. The APM agents of each application in the different Kubernetes clusters are now sending metrics and traces to "Kubernetes Cluster A" and can be viewed beautifully in Kibana. The Elastic agents with the package "kubernetes" in "Kubernetes Cluster A" send Kubernetes metrics & logs to elasticsearch and can also be viewed in Kibana. So as far as a "normal" setup, which works wonderfully.

Now, to also get Kubernetes metrics and logs of Kubernetes clusters B and C, an Elastic agent was provisioned. Through the parameters FLEET_URL and FLEET_ENROLLMENT_TOKEN the agent is registered correctly and gets access to Elasticsearch. Subsequently, the agent installs the integration packages stored in the specified policy - in this case "kubernetes:latest". After the installation is complete, the status of the agents displayed in Kibana changes to "Healthy".

elastic-fleet-example

What did you expect to see? There should be metrics and logs stored in elasticsearch for each Kubernetes node of clusters B and C and viewable through Kibana. The logs of the individual datasets should be visible under "Management/Fleet/AgentXX/Logs". If an agent does not have the status "Healthy", a corresponding message should be displayed, which ideally also describes the problem as well as possible.

What did you see instead? Under which circumstances? Although the status of the Fleet-enrolled Elastic agent is shown as "Healthy" and there is no error message in either the Fleet server log or the Elastic agent log, no logs or metrics are stored in elasticsearch. Also when I call the command elastic-agent status within an agent container "Healthy" is reported. The command elastic-agent inspect shows a correct configuration and a valid api_key.

Environment

  • ECK version: 2.2.0

  • Kubernetes information: Azure Kubernetes Service (AKS) v1.22.4

  • kubectl version: v1.24.0

  • fleet-server resource definition:

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
  namespace: elastic-apps
  labels:
    app.kubernetes.io/instance: fleet-server
    app.kubernetes.io/name: elastic-stack
    app.kubernetes.io/component: fleet-server
spec:
  version: 8.1.3
  kibanaRef:
    name: kibana
    namespace: elastic-apps
  elasticsearchRefs:
  - name: elasticsearch-data
    namespace: elastic-apps
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 1
    podTemplate:
      spec:
        nodeSelector:
          kubernetes.io/os: linux
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
  • kubernetes cluster b elastic-agent resource definition:
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata: 
  name: elastic-agent-k8s
  namespace: elastic-apps
  labels:
    app.kubernetes.io/name: elastic-stack
    app.kubernetes.io/component: elastic-agent
spec:
  version: 8.1.3
  mode: fleet
  daemonSet:
    podTemplate:
      spec:
        nodeSelector:
          kubernetes.io/os: linux
        tolerations:
          - key: node-role.kubernetes.io/master
            effect: NoSchedule
        serviceAccountName: elastic-agent
        automountServiceAccountToken: true
        hostNetwork: true
        dnsPolicy: ClusterFirstWithHostNet
        containers:
          - name: agent
            env:
              - name: FLEET_ENROLL
                value: "1"
              # Set to true in case of insecure or unverified HTTP
              - name: FLEET_INSECURE
                value: "true"
              - name: FLEET_URL
                value: "https://fleet.mydomain.com"
              - name: FLEET_ENROLLMENT_TOKEN
                value: "WXNEQWdZRUJjRjFBcEFlYWpZRzI6WjE4MkhtNWZUV2kyUzRCUjlyWUtyQQ=="
              - name: NODE_NAME
                valueFrom:
                  fieldRef:
                    fieldPath: spec.nodeName
              - name: POD_NAME
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.name
            securityContext:
              runAsUser: 0
            volumeMounts:
              - name: proc
                mountPath: /hostfs/proc
                readOnly: true
              - name: cgroup
                mountPath: /hostfs/sys/fs/cgroup
                readOnly: true
              - name: varlibdockercontainers
                mountPath: /var/lib/docker/containers
                readOnly: true
              - name: varlog
                mountPath: /var/log
                readOnly: true
        volumes:
          - name: proc
            hostPath:
              path: /proc
          - name: cgroup
            hostPath:
              path: /sys/fs/cgroup
          - name: varlibdockercontainers
            hostPath:
              path: /var/lib/docker/containers
          - name: varlog
            hostPath:
              path: /var/log
  • Fleet-Server Container Logs:
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":48},"message":"New State ID is A7oZEzcz","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":49},"message":"Converging state requires execution of 0 step(s)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":66},"message":"Updating internal state","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":48},"message":"New State ID is A7oZEzcz","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":49},"message":"Converging state requires execution of 0 step(s)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":66},"message":"Updating internal state","ecs.version":"1.6.0"}
  • GET https://fleet.mydomain.com/api/status
{"name":"fleet-server","status":"HEALTHY"}

derbl4ck avatar Jul 08 '22 12:07 derbl4ck

Reading your bug report this does not sound like an issue with the way the ECK operator manages Elastic Agent but potentially more an issue with Elastic Agent itself? It might be worth taking this up with the Fleet/Agent team. It might also be worth getting Elastic Agent diagnostics from the problematic cluster. The eck-diagnostics tool optionally does that for you https://github.com/elastic/eck-diagnostics If you have a support contract with Elastic the best way to make sure your issue is routed to the right people is to open a support case to look further into it.

pebrc avatar Jul 14 '22 09:07 pebrc

I am facing similar issue as the OP. The eck ( latest version ) is setup on our azure environment . We have kept all the services like Kibana, Fleet , Elasticsearch behind an ingress . The fleet-server's URL is kind of https://xxx.mydomain.com:443/fleetserver-eck . I have downloaded the agent manifest file from kibana and used the https://xxx.mydomain.com:443/fleetserver-eck as fleet server URL in that file. The agent is getting successfully enrolled but after that the fleet server actually passing it's internal URL - https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?\ to teh agent. The exact error is :

{"log.level":"error","@timestamp":"2022-08-26T09:30:13.406Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":211},"message":"failed to dispatch actions, error: fail to communicate with updated API client hosts: Get "https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?": lookup fleet-server-eck-agent-http.namespace.svc on 10.96.0.10:53: no such host","ecs.version":"1.6.0"}.

This seems to be an issue at the end of fleet server in the way how it is handling the connection. It has become a headache now after researching for last few days and no answer. It has become a blocker to connect to the fleet server. Any help is much appreciated.

SanjuTechie87 avatar Aug 26 '22 09:08 SanjuTechie87

Did you setup correctly the output ?

This seems buggy (I mean this should be done automatically by the operator in my mind), but you must manually set the elasticsearch output URL:

Capture d’écran, le 2023-01-22 à 11 22 02

ebuildy avatar Jan 22 '23 16:01 ebuildy