cloud-on-k8s
cloud-on-k8s copied to clipboard
Elastic-Agent in external k8s-cluster does not send data after installation
Bug Report
What did you do? I run several Kubernetes clusters (AKS; Ubuntu) in which microservices and other applications are provisioned, as well as a dedicated AKS in which Kibana, elasticsearch, fleet server and elastic agents are located. These are provided by the eck-operator v2.2.0. I have made and attached a graphic to illustrate this.
Kibana, elasticsearch, fleet server and APM server are provisioned through nginx-ingress on the network. The APM agents of each application in the different Kubernetes clusters are now sending metrics and traces to "Kubernetes Cluster A" and can be viewed beautifully in Kibana. The Elastic agents with the package "kubernetes" in "Kubernetes Cluster A" send Kubernetes metrics & logs to elasticsearch and can also be viewed in Kibana. So as far as a "normal" setup, which works wonderfully.
Now, to also get Kubernetes metrics and logs of Kubernetes clusters B and C, an Elastic agent was provisioned. Through the parameters FLEET_URL
and FLEET_ENROLLMENT_TOKEN
the agent is registered correctly and gets access to Elasticsearch. Subsequently, the agent installs the integration packages stored in the specified policy - in this case "kubernetes:latest". After the installation is complete, the status of the agents displayed in Kibana changes to "Healthy".
What did you expect to see? There should be metrics and logs stored in elasticsearch for each Kubernetes node of clusters B and C and viewable through Kibana. The logs of the individual datasets should be visible under "Management/Fleet/AgentXX/Logs". If an agent does not have the status "Healthy", a corresponding message should be displayed, which ideally also describes the problem as well as possible.
What did you see instead? Under which circumstances?
Although the status of the Fleet-enrolled Elastic agent is shown as "Healthy" and there is no error message in either the Fleet server log or the Elastic agent log, no logs or metrics are stored in elasticsearch. Also when I call the command elastic-agent status
within an agent container "Healthy" is reported. The command elastic-agent inspect
shows a correct configuration and a valid api_key
.
Environment
-
ECK version: 2.2.0
-
Kubernetes information: Azure Kubernetes Service (AKS) v1.22.4
-
kubectl version: v1.24.0
-
fleet-server resource definition:
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
name: fleet-server
namespace: elastic-apps
labels:
app.kubernetes.io/instance: fleet-server
app.kubernetes.io/name: elastic-stack
app.kubernetes.io/component: fleet-server
spec:
version: 8.1.3
kibanaRef:
name: kibana
namespace: elastic-apps
elasticsearchRefs:
- name: elasticsearch-data
namespace: elastic-apps
mode: fleet
fleetServerEnabled: true
deployment:
replicas: 1
podTemplate:
spec:
nodeSelector:
kubernetes.io/os: linux
serviceAccountName: fleet-server
automountServiceAccountToken: true
securityContext:
runAsUser: 0
- kubernetes cluster b elastic-agent resource definition:
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
name: elastic-agent-k8s
namespace: elastic-apps
labels:
app.kubernetes.io/name: elastic-stack
app.kubernetes.io/component: elastic-agent
spec:
version: 8.1.3
mode: fleet
daemonSet:
podTemplate:
spec:
nodeSelector:
kubernetes.io/os: linux
tolerations:
- key: node-role.kubernetes.io/master
effect: NoSchedule
serviceAccountName: elastic-agent
automountServiceAccountToken: true
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
containers:
- name: agent
env:
- name: FLEET_ENROLL
value: "1"
# Set to true in case of insecure or unverified HTTP
- name: FLEET_INSECURE
value: "true"
- name: FLEET_URL
value: "https://fleet.mydomain.com"
- name: FLEET_ENROLLMENT_TOKEN
value: "WXNEQWdZRUJjRjFBcEFlYWpZRzI6WjE4MkhtNWZUV2kyUzRCUjlyWUtyQQ=="
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
securityContext:
runAsUser: 0
volumeMounts:
- name: proc
mountPath: /hostfs/proc
readOnly: true
- name: cgroup
mountPath: /hostfs/sys/fs/cgroup
readOnly: true
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: varlog
mountPath: /var/log
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: cgroup
hostPath:
path: /sys/fs/cgroup
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: varlog
hostPath:
path: /var/log
- Fleet-Server Container Logs:
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":48},"message":"New State ID is A7oZEzcz","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":49},"message":"Converging state requires execution of 0 step(s)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":66},"message":"Updating internal state","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":48},"message":"New State ID is A7oZEzcz","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":49},"message":"Converging state requires execution of 0 step(s)","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"","log.origin":{"file.name":"stateresolver/stateresolver.go","file.line":66},"message":"Updating internal state","ecs.version":"1.6.0"}
- GET https://fleet.mydomain.com/api/status
{"name":"fleet-server","status":"HEALTHY"}
Reading your bug report this does not sound like an issue with the way the ECK operator manages Elastic Agent but potentially more an issue with Elastic Agent itself? It might be worth taking this up with the Fleet/Agent team. It might also be worth getting Elastic Agent diagnostics from the problematic cluster. The eck-diagnostics tool optionally does that for you https://github.com/elastic/eck-diagnostics If you have a support contract with Elastic the best way to make sure your issue is routed to the right people is to open a support case to look further into it.
I am facing similar issue as the OP. The eck ( latest version ) is setup on our azure environment . We have kept all the services like Kibana, Fleet , Elasticsearch behind an ingress . The fleet-server's URL is kind of https://xxx.mydomain.com:443/fleetserver-eck . I have downloaded the agent manifest file from kibana and used the https://xxx.mydomain.com:443/fleetserver-eck as fleet server URL in that file. The agent is getting successfully enrolled but after that the fleet server actually passing it's internal URL - https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?\ to teh agent. The exact error is :
{"log.level":"error","@timestamp":"2022-08-26T09:30:13.406Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":211},"message":"failed to dispatch actions, error: fail to communicate with updated API client hosts: Get "https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?": lookup fleet-server-eck-agent-http.namespace.svc on 10.96.0.10:53: no such host","ecs.version":"1.6.0"}.
This seems to be an issue at the end of fleet server in the way how it is handling the connection. It has become a headache now after researching for last few days and no answer. It has become a blocker to connect to the fleet server. Any help is much appreciated.
Did you setup correctly the output
?
This seems buggy (I mean this should be done automatically by the operator in my mind), but you must manually set the elasticsearch output URL:
