cloud-on-k8s icon indicating copy to clipboard operation
cloud-on-k8s copied to clipboard

ECK Fleet Server Behind Ingress - elastic agents becoming unhealthy

Open SanjuTechie87 opened this issue 2 years ago • 5 comments

Hi All,

Brief Fact:

We have deployed ECK in Azure AKS. The whole thing is behind an Ingress as seen in the below Diagram. The requirement is to connect elastic agents which are residing outside of the ECK cluster to fleet servers ( which are residing inside the cluster ). Agents can be from internal corporate network or can connect through Internet. Therefore an Ingress has been setup to load balance between fleet servers. In the Ingress we have configured three backend service :

  • Elasticsearch - https://xxxx.mydomain.com:443/elasticsearch-eck
  • Kibana - https://xxxx.mydomain.com:443/kibana-eck
  • Fleet Server - https://xxxx.mydomain.com:443/fleetserver-eck

image

We have no problem in connecting Kibana and Elasticsearch through Ingress .

Issue Currently Being Faced:

The issue we are facing is when any elastic -agent which is outside the cluster is trying to connect to fleet-server through Ingress , The agent is getting successfully enrolled but it is turning unhealthy.

What we found out in local agent's log is, after the agent is enrolled in the fleet server ( Ingress URL - https://xxxx.mydomain.com:443/fleetserver-eck is used during enrollment ) the fleet server is actually returning back it's internal URL - [https://fleet-server-eck-agent-http.namespace.svc:8220/api/status? ] in response to the elastic agent . It is the fleet server's Kubernetes service URL which the external elastic agent has no means to resolve to.

The exact error is :

{"log.level":"error","@timestamp":"2022-08-26T09:30:13.406Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":211},"message":"failed to dispatch actions, error: fail to communicate with updated API client hosts: Get "[https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?](https://fleet-server-eck-agent-http.namespace.svc:8220/api/status?%5C)": lookup fleet-server-eck-agent-http.namespace.svc on 10.96.0.10:53: no such host","ecs.version":"1.6.0"}.

Different Options Tried

  • Added the Ingress URL in the Kibana config xpack.fleet.agents.fleet_server.hosts: along with the fleet server's service URL . i.e. :

         - https://xxxx.mydomain.com:443/fleetserver-eck
         - https://fleet-server-eck-agent-http.namespace.svc:8220
    
  • Used --proxy-url and provided the ingress url https://xxxx.mydomain.com:443/fleetserver-eck during starting the elastic agent

None of the above options helped .

Note: When tried curl in https://xxxx.mydomain.com:443/fleetserver-eck/api/status , It is showing healthy status.

Elastic Agent Configuration

apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: elastic-agent-ums
  namespace: observability
spec:
  version: 8.4.0
  kibanaRef:
    name: kibana-eck
  fleetServerRef:
    name: fleet-server-eck
  mode: fleet
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: elastic-agent-serviceaccount
        hostNetwork: true
        dnsPolicy: ClusterFirstWithHostNet
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
  1. Fleet Server Configuration
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server-eck
  namespace: observability
spec:
  version: 8.4.0
  kibanaRef:
    name: kibana-eck
  elasticsearchRefs:
    - name: elasticsearch-eck
  mode: fleet
  fleetServerEnabled: true
  deployment:
    replicas: 2
    podTemplate:
      spec:
        serviceAccountName: fleet-server-serviceaccount
        automountServiceAccountToken: true
        securityContext:
          runAsUser: 0
  1. Kibana Config
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana-eck
  namespace: observability
spec:
  version: 8.4.0
  count: 2
  elasticsearchRef:
    name: elasticsearch-eck
  config:
    xpack.fleet.agents.elasticsearch.hosts:
      ["https://elasticsearch-eck-es-http.observability.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts:
      ["https://fleet-server-eck-agent-http.observability.svc:8220"]
    xpack.fleet.packages:
      - name: system
        version: latest
      - name: elastic_agent
        version: latest
      - name: fleet_server
        version: latest
      - name: kubernetes
        version: 0.14.0
      - name: apm
        version: latest
      # pinning this version as the next one introduced a kube-proxy host setting default that breaks this recipe,
      # see https://github.com/elastic/integrations/pull/1565 for more details

    xpack.fleet.agentPolicies:
      - name: Fleet Server on ECK policy
        id: eck-fleet-server
        namespace: observability
        monitoring_enabled:
          - logs
          - metrics
        is_default_fleet_server: true
        package_policies:
          - name: fleet_server-1
            id: fleet_server-1
            package:
              name: fleet_server
      - name: Elastic Agent on ECK policy
        id: eck-agent
        namespace: observability
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        is_default: true
        package_policies:
          - name: system-1
            id: system-1
            package:
              name: system
          - name: kubernetes-1
            id: kubernetes-1
            package:
              name: kubernetes
          - name: apm-1
            id: apm-1
            package:
              name: apm
            inputs:
              - type: apm
                enabled: true
                vars:
                  - name: host
                    value: 0.0.0.0:8200

We are stuck with this issue now for many days. Any help is much appreciated. We really need help on this. Please let us know if any additional configuration we need to do which is currently missing. Also whether what we are trying to achieve even it is supported now or not. Thanks

SanjuTechie87 avatar Aug 30 '22 09:08 SanjuTechie87

Hi, I have a similiar setup as yours and having the same issue.

I'm able to connect with an elastic-agent from SLES 15 SP3 OS and own-signed certificates (from our windows environment).

The elastic agent switch status from "healthy" to "unhealthy" approx. 5 min after the enrollment.

I get the following error message when I run elatic-agent status on my SLES 15 machine.

Status: DEGRADED
Message: component gateway-a6d13732: checkin failed: could not decode the response, raw response: <html>
<head><title>504 Gateway Time-out</title></head>
<body>
<center><h1>504 Gateway Time-out</h1></center>
<hr><center>nginx</center>
</body>
</html>

This can indicate that it maybe something wrong with my kubernetes-ingress-controller so will continue the troubleshooting.

This is the error message from kibana: First it fails to connect to elasticsearch but after the second try it seems that the connection gets established but don't know if this is related to elastic-agent gets degraded.

[elastic_agent.filebeat][error] failed to perform any bulk index operations: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped
12:00:58.064
elastic_agent.filebeat
[elastic_agent.filebeat][error] failed to publish events: the bulk payload is too large for the server. Consider to adjust `http.max_content_length` parameter in Elasticsearch or `bulk_max_size` in the beat. The batch has been dropped
12:00:58.065
elastic_agent.filebeat
[elastic_agent.filebeat][info] Connecting to backoff(elasticsearch(https://https://es02t.x.x:443))
12:00:58.094
elastic_agent.filebeat
[elastic_agent.filebeat][info] Attempting to connect to Elasticsearch version 8.4.0
12:00:58.160
elastic_agent.filebeat
[elastic_agent.filebeat][info] Connection to backoff(elasticsearch(https://es02t.x.x:443)) established
12:01:00.729
system.syslog
snapperd.service: Succeeded.

Any help is much appreciated from my part also, thanks.

gittihub123 avatar Sep 01 '22 11:09 gittihub123

Hi @SanjuTechie87, I solved the issue by changing the ingress-controller configuration with the following values.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: es-ingress
  namespace: default
  annotations:
    nginx.ingress.kubernetes.io/backend-protocol: HTTPS
    nginx.ingress.kubernetes.io/secure-backends: "true"
    ingress.kubernetes.io/ssl-passthrough: "true"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "360"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "360"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "360"

It seems like the agent have an connection for filebeat open for 5m and close the connection and restart another one as shown below.

09:12:41.736
elastic_agent.filebeat
[elastic_agent.filebeat][info] File is inactive. Closing because close_inactive of 5m0s reached.
09:12:58.743
elastic_agent.filebeat
[elastic_agent.filebeat][info] Harvester started for paths: [/var/log/messages* /var/log/syslog*]
09:13:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:13:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:14:08.146
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:14:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:15:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:15:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:16:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:16:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:17:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:17:38.148
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:18:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:18:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:19:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:19:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:20:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:20:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:21:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:21:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:22:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:22:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:23:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:23:38.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s
09:24:05.791
elastic_agent.filebeat
[elastic_agent.filebeat][info] File is inactive. Closing because close_inactive of 5m0s reached.
09:24:08.147
elastic_agent.filebeat
[elastic_agent.filebeat][info] Non-zero metrics in the last 30s

Related link to the issue - https://stefangeiger.ch/2022/04/03/aks-ingress-timeout.html Hope it works out for you.

gittihub123 avatar Sep 02 '22 08:09 gittihub123

I have a similar problem: https://github.com/elastic/cloud-on-k8s/issues/5867

@SanjuTechie87 Did this suggested solution from @gittihub123 work for you?

derbl4ck avatar Sep 12 '22 14:09 derbl4ck

Hi @derbl4ck I can confirm that it worked worked for me. Please post your logs and I will help you solve it :)

gittihub123 avatar Sep 12 '22 15:09 gittihub123

Hi All,

I have made it work . The problem here is if I want the elastic agents which is inside the ECK-cluster to communicate with fleet and Elasticsearch using kubernetes internal service URL ( e.g. xxxx.namespace.svc.8220 ) there is a difficulty . The traffic always goes via ingress URL of fleet and Elasticsearch because in Kibana output section we have to give the fleet and ingress URL as first URL in the array . If we don't provide it as first URL , the agents which are out side the ECK-cluster will not able to connect because it seems by default the agents are trying to connect to first URL provided in the output section.

We don't want the agents inside the ECK - cluster should make a internet connection and connect through ingress URL rather they should go through the internal service URL. But I didn't find a solution on this. I tried overwrite the fleet and Elasticsearch URLs in the environment variable of the elastic agent's pods which are inside ECK-cluster but it still seems to connect to the first URL ( which is ingress URL ) that is provided in Kibana Output.

@gittihub123 did you face the above issue ? if yes what solution you have implemented .

SanjuTechie87 avatar Sep 19 '22 03:09 SanjuTechie87

Hi,

We don't want the agents inside the ECK - cluster should make a internet connection and connect through ingress URL rather they should go through the internal service URL. But I didn't find a solution on this.

I understand your problem, unfortunately my feeling is that your issue seems more related to the Fleet project than to the operator itself, which explains why this issue did not get a lot of traction. I have still pinged the Elastic observability team to notify them about your questions.

barkbay avatar Oct 27 '22 08:10 barkbay

I'm closing this issue because my feeling is that your questions are more related to Fleet itself than to the operator. I would suggest to open a new topic in discuss to reach out to the observability team.

barkbay avatar Nov 21 '22 10:11 barkbay

Hi All,

I have made it work . The problem here is if I want the elastic agents which is inside the ECK-cluster to communicate with fleet and Elasticsearch using kubernetes internal service URL ( e.g. xxxx.namespace.svc.8220 ) there is a difficulty . The traffic always goes via ingress URL of fleet and Elasticsearch because in Kibana output section we have to give the fleet and ingress URL as first URL in the array . If we don't provide it as first URL , the agents which are out side the ECK-cluster will not able to connect because it seems by default the agents are trying to connect to first URL provided in the output section.

We don't want the agents inside the ECK - cluster should make a internet connection and connect through ingress URL rather they should go through the internal service URL. But I didn't find a solution on this. I tried overwrite the fleet and Elasticsearch URLs in the environment variable of the elastic agent's pods which are inside ECK-cluster but it still seems to connect to the first URL ( which is ingress URL ) that is provided in Kibana Output.

@gittihub123 did you face the above issue ? if yes what solution you have implemented .

Hi. Did you find better solution?

mackuz avatar Mar 31 '23 05:03 mackuz