cloud-on-k8s
cloud-on-k8s copied to clipboard
ECK Fleet Server Behind Ingress - elastic agents becoming unhealthy
Hi All,
Brief Fact:
We have deployed ECK in Azure AKS. The whole thing is behind an Ingress as seen in the below Diagram. The requirement is to connect elastic agents which are residing outside of the ECK cluster to fleet servers ( which are residing inside the cluster ). Agents can be from internal corporate network or can connect through Internet. Therefore an Ingress has been setup to load balance between fleet servers. In the Ingress we have configured three backend service :
- Elasticsearch - https://xxxx.mydomain.com:443/elasticsearch-eck
- Kibana - https://xxxx.mydomain.com:443/kibana-eck
- Fleet Server - https://xxxx.mydomain.com:443/fleetserver-eck
We have no problem in connecting Kibana and Elasticsearch through Ingress .
Issue Currently Being Faced:
The issue we are facing is when any elastic -agent which is outside the cluster is trying to connect to fleet-server through Ingress , The agent is getting successfully enrolled but it is turning unhealthy.
What we found out in local agent's log is, after the agent is enrolled in the fleet server ( Ingress URL - https://xxxx.mydomain.com:443/fleetserver-eck is used during enrollment ) the fleet server is actually returning back it's internal URL - [https://fleet-server-eck-agent-http.namespace.svc:8220/api/status? ] in response to the elastic agent . It is the fleet server's Kubernetes service URL which the external elastic agent has no means to resolve to.
The exact error is :
{"log.level":"error","@timestamp":"2022-08-26T09:30:13.406Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":211},"message":"failed to dispatch actions, error: fail to communicate with updated API client hosts: Get "[https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?](https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?%5C)": lookup fleet-server-eck-agent-http.namespace.svc on 10.96.0.10:53: no such host","ecs.version":"1.6.0"}.
Different Options Tried
-
Added the Ingress URL in the Kibana config xpack.fleet.agents.fleet_server.hosts: along with the fleet server's service URL . i.e. :
- https://xxxx.mydomain.com:443/fleetserver-eck - https://fleet-server-eck-agent-http.namespace.svc:8220
-
Used --proxy-url and provided the ingress url https://xxxx.mydomain.com:443/fleetserver-eck during starting the elastic agent
None of the above options helped .
Note: When tried curl in https://xxxx.mydomain.com:443/fleetserver-eck/api/status , It is showing healthy status.
Elastic Agent Configuration
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
name: elastic-agent-ums
namespace: observability
spec:
version: 8.4.0
kibanaRef:
name: kibana-eck
fleetServerRef:
name: fleet-server-eck
mode: fleet
daemonSet:
podTemplate:
spec:
serviceAccountName: elastic-agent-serviceaccount
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
automountServiceAccountToken: true
securityContext:
runAsUser: 0
- Fleet Server Configuration
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
name: fleet-server-eck
namespace: observability
spec:
version: 8.4.0
kibanaRef:
name: kibana-eck
elasticsearchRefs:
- name: elasticsearch-eck
mode: fleet
fleetServerEnabled: true
deployment:
replicas: 2
podTemplate:
spec:
serviceAccountName: fleet-server-serviceaccount
automountServiceAccountToken: true
securityContext:
runAsUser: 0
- Kibana Config
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: kibana-eck
namespace: observability
spec:
version: 8.4.0
count: 2
elasticsearchRef:
name: elasticsearch-eck
config:
xpack.fleet.agents.elasticsearch.hosts:
["https://elasticsearch-eck-es-http.observability.svc:9200"]
xpack.fleet.agents.fleet_server.hosts:
["https://fleet-server-eck-agent-http.observability.svc:8220"]
xpack.fleet.packages:
- name: system
version: latest
- name: elastic_agent
version: latest
- name: fleet_server
version: latest
- name: kubernetes
version: 0.14.0
- name: apm
version: latest
# pinning this version as the next one introduced a kube-proxy host setting default that breaks this recipe,
# see https://github.com/elastic/integrations/pull/1565 for more details
xpack.fleet.agentPolicies:
- name: Fleet Server on ECK policy
id: eck-fleet-server
namespace: observability
monitoring_enabled:
- logs
- metrics
is_default_fleet_server: true
package_policies:
- name: fleet_server-1
id: fleet_server-1
package:
name: fleet_server
- name: Elastic Agent on ECK policy
id: eck-agent
namespace: observability
monitoring_enabled:
- logs
- metrics
unenroll_timeout: 900
is_default: true
package_policies:
- name: system-1
id: system-1
package:
name: system
- name: kubernetes-1
id: kubernetes-1
package:
name: kubernetes
- name: apm-1
id: apm-1
package:
name: apm
inputs:
- type: apm
enabled: true
vars:
- name: host
value: 0.0.0.0:8200
We are stuck with this issue now for many days. Any help is much appreciated. We really need help on this. Please let us know if any additional configuration we need to do which is currently missing. Also whether what we are trying to achieve even it is supported now or not. Thanks
Hi, I have a similiar setup as yours and having the same issue.
I'm able to connect with an elastic-agent from SLES 15 SP3 OS and own-signed certificates (from our windows environment).
The elastic agent switch status from "healthy" to "unhealthy" approx. 5 min after the enrollment.
I get the following error message when I run elatic-agent status on my SLES 15 machine.
Status: DEGRADED
Message: component gateway-a6d13732: checkin failed: could not decode the response, raw response: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>
This can indicate that it maybe something wrong with my kubernetes-ingress-controller so will continue the troubleshooting.
This is the error message from kibana: First it fails to connect to elasticsearch but after the second try it seems that the connection gets established but don't know if this is related to elastic-agent gets degraded.
[elastic_agent.filebeat][error] failed to perform any bulk index operations: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped
12:00:58.064
elastic_agent.filebeat
[elastic_agent.filebeat][error] failed to publish events: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped
12:00:58.065
elastic_agent.filebeat
[elastic_agent.filebeat][info] Connecting to backoff(elasticsearch(https://https://es02t.x.x:443))
12:00:58.094
elastic_agent.filebeat
[elastic_agent.filebeat][info] Attempting to connect to Elasticsearch version 8.4.0
12:00:58.160
elastic_agent.filebeat
[elastic_agent.filebeat][info] Connection to backoff(elasticsearch(https://es02t.x.x:443)) established
12:01:00.729
system.syslog
snapperd.service: Succeeded.
Any help is much appreciated from my part also, thanks.
Hi @SanjuTechie87, I solved the issue by changing the ingress-controller configuration with the following values.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: es-ingress
namespace: default
annotations:
nginx.ingress.kubernetes.io/backend-protocol: HTTPS
nginx.ingress.kubernetes.io/secure-backends: "true"
ingress.kubernetes.io/ssl-passthrough: "true"
nginx.ingress.kubernetes.io/proxy-read-timeout: "360"
nginx.ingress.kubernetes.io/proxy-send-timeout: "360"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "360"
It seems like the agent have an connection for filebeat open for 5m and close the connection and restart another one as shown below.
09:12:41.736
elastic_agent.filebeat
[elastic_agent.filebeat][info] File is inactive. Closing because close_inactive of 5m0s reached.
09:12:58.743
elastic_agent.filebeat
[elastic_agent.filebeat][info] Harvester started for paths: [/var/log/messages* /var/log/syslog*]
09:13:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:13:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:14:08.146
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:14:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:15:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:15:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:16:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:16:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:17:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:17:38.148
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:18:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:18:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:19:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:19:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:20:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:20:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:21:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:21:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:22:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:22:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:23:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:23:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:24:05.791
elastic_agent.filebeat
[elastic_agent.filebeat][info] File is inactive. Closing because close_inactive of 5m0s reached.
09:24:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
Related link to the issue - https://stefangeiger.ch/2022/04/03/aks-ingress-timeout.html Hope it works out for you.
I have a similar problem: https://github.com/elastic/cloud-on-k8s/issues/5867
@SanjuTechie87 Did this suggested solution from @gittihub123 work for you?
Hi @derbl4ck I can confirm that it worked worked for me. Please post your logs and I will help you solve it :)
Hi All,
I have made it work . The problem here is if I want the elastic agents which is inside the ECK-cluster to communicate with fleet and Elasticsearch using kubernetes internal service URL ( e.g. xxxx.namespace.svc.8220 ) there is a difficulty . The traffic always goes via ingress URL of fleet and Elasticsearch because in Kibana output section we have to give the fleet and ingress URL as first URL in the array . If we don't provide it as first URL , the agents which are out side the ECK-cluster will not able to connect because it seems by default the agents are trying to connect to first URL provided in the output section.
We don't want the agents inside the ECK - cluster should make a internet connection and connect through ingress URL rather they should go through the internal service URL. But I didn't find a solution on this. I tried overwrite the fleet and Elasticsearch URLs in the environment variable of the elastic agent's pods which are inside ECK-cluster but it still seems to connect to the first URL ( which is ingress URL ) that is provided in Kibana Output.
@gittihub123 did you face the above issue ? if yes what solution you have implemented .
Hi,
We don't want the agents inside the ECK - cluster should make a internet connection and connect through ingress URL rather they should go through the internal service URL. But I didn't find a solution on this.
I understand your problem, unfortunately my feeling is that your issue seems more related to the Fleet project than to the operator itself, which explains why this issue did not get a lot of traction. I have still pinged the Elastic observability team to notify them about your questions.
I'm closing this issue because my feeling is that your questions are more related to Fleet itself than to the operator. I would suggest to open a new topic in discuss to reach out to the observability team.
Hi All,
I have made it work . The problem here is if I want the elastic agents which is inside the ECK-cluster to communicate with fleet and Elasticsearch using kubernetes internal service URL ( e.g. xxxx.namespace.svc.8220 ) there is a difficulty . The traffic always goes via ingress URL of fleet and Elasticsearch because in Kibana output section we have to give the fleet and ingress URL as first URL in the array . If we don't provide it as first URL , the agents which are out side the ECK-cluster will not able to connect because it seems by default the agents are trying to connect to first URL provided in the output section.
We don't want the agents inside the ECK - cluster should make a internet connection and connect through ingress URL rather they should go through the internal service URL. But I didn't find a solution on this. I tried overwrite the fleet and Elasticsearch URLs in the environment variable of the elastic agent's pods which are inside ECK-cluster but it still seems to connect to the first URL ( which is ingress URL ) that is provided in Kibana Output.
@gittihub123 did you face the above issue ? if yes what solution you have implemented .
Hi. Did you find better solution?