cloud-on-k8s icon indicating copy to clipboard operation
cloud-on-k8s copied to clipboard

fleet-server "failed to fetch elasticsearch version" - ECK install on OpenShift isn't working

Open ALL-SPACE-Anas opened this issue 1 year ago • 7 comments
trafficstars

Elasticsearch Version

Version: 8.15.2, Build: docker/98adf7bf6bb69b66ab95b761c9e5aadb0bb059a3/2024-09-19T10:06:03.564235954Z, JVM: 22.0.1

Installed Plugins

No response

Java Version

bundled

OS Version

OpenShift BareMetal

Problem Description

I have deployed ECK on OpenShift baremetal servers for POC. While I can get kibana dashboard, I cannot get fleet-server to start and work. I'm using default configuration (from these documentations https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-openshift-deploy-the-operator.html and https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-elastic-agent-fleet-quickstart.html) for the most part with little modifications where needed.

these are my manifests:

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana-sample
spec:
  version: 8.15.2
  count: 1
  elasticsearchRef:
    name: "elasticsearch-sample"
  podTemplate:
    spec:
      containers:
      - name: kibana
        resources:
          limits:
            memory: 1Gi
            cpu: 1
  config:
    server.publicBaseUrl: "https://#######"
    xpack.fleet.agents.elasticsearch.hosts: ["https://elasticsearch-sample-es-http.elastic.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-sample-agent-http.elastic.svc:8220"]
    xpack.fleet.packages:
      - name: system
        version: latest
      - name: elastic_agent
        version: latest
      - name: fleet_server
        version: latest
      - name: apm
        version: latest
    xpack.fleet.agentPolicies:
      - name: Fleet Server on ECK policy
        id: eck-fleet-server
        namespace: elastic
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        package_policies:
        - name: fleet_server-1
          id: fleet_server-1
          package:
            name: fleet_server
      - name: Elastic Agent on ECK policy
        id: eck-agent
        namespace: elastic
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        package_policies:
          - name: system-1
            id: system-1
            package:
              name: system
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-sample
spec:
  version: 8.15.2
  nodeSets:
    - name: default
      count: 1
      config:
        node.store.allow_mmap: false
        index.store.type: niofs # https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-store.html
---
apiVersion: apm.k8s.elastic.co/v1
kind: ApmServer
metadata:
  name:apm-server-sample
spec:
  version: 8.15.2
  count: 1
  elasticsearchRef:
    name: "elasticsearch-sample"
  kibanaRef: 
    name: kibana-sample
  podTemplate:
    spec:
      serviceAccountName: apm-server

Agent state: oc get agents

NAME                   HEALTH   AVAILABLE   EXPECTED   VERSION   AGE
elastic-agent-sample   green    3           3          8.15.2    138m
fleet-server-sample    red                  1          8.15.2    138m

oc describe agent fleet-server-sample

Name:         fleet-server-sample
Namespace:    elastic
Labels:       <none>
Annotations:  ###
API Version:  agent.k8s.elastic.co/v1alpha1
Kind:         Agent
Metadata: ###
Spec:
  Deployment:
    Pod Template:
      Metadata:
        Creation Timestamp:  <nil>
      Spec:
        Automount Service Account Token:  true
        Containers:                       <nil>
        Security Context:
          Run As User:         0
        Service Account Name:  elastic-agent
        Volumes:
          Name:  agent-data
          Persistent Volume Claim:
            Claim Name:  fleet-server-sample
    Replicas:            1
    Strategy:
  Elasticsearch Refs:
    Name:                elasticsearch-sample
  Fleet Server Enabled:  true
  Fleet Server Ref:
  Http:
    Service:
      Metadata:
      Spec:
    Tls:
      Certificate:
  Kibana Ref:
    Name:     kibana-sample
  Mode:       fleet
  Policy ID:  eck-fleet-server
  Version:    8.15.2
Status:
  Elasticsearch Associations Status:
    elastic/elasticsearch-sample:  Established
  Expected Nodes:                  1
  Health:                          red
  Kibana Association Status:       Established
  Observed Generation:             2
  Version:                         8.15.2
Events:
  Type     Reason                   Age                   From                                 Message
  ----     ------                   ----                  ----                                 -------
  Warning  AssociationError         138m (x5 over 138m)   agent-controller                     Association backend for elasticsearch is not configured
  Warning  AssociationError         138m (x9 over 138m)   agent-controller                     Association backend for kibana is not configured
  Normal   AssociationStatusChange  138m                  agent-es-association-controller      Association status changed from [] to [elastic/elasticsearch-sample: Established]
  Normal   AssociationStatusChange  138m                  agent-kibana-association-controller  Association status changed from [] to [Established]
  Warning  Delayed                  138m (x11 over 138m)  agent-controller                     Delaying deployment of Elastic Agent in Fleet Mode as Kibana is not available yet

fleet-server pod error logs (which is in CrashLoopBackoff):

{"log.level":"error","@timestamp":"2024-10-14T16:35:35.550Z","message":"failed to fetch elasticsearch version","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"@timestamp":"2024-10-14T16:35:35.55Z","ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","error.message":"dial tcp [::1]:9200: connect: connection refused","ecs.version":"1.6.0"}
{"log.level":"warn","@timestamp":"2024-10-14T16:35:35.551Z","message":"Failed Elasticsearch output configuration test, using bootstrap values.","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0","service.name":"fleet-server","service.type":"fleet-server","error.message":"dial tcp [::1]:9200: connect: connection refused","output":{"hosts":["localhost:9200"],"protocol":"https","proxy_disable":false,"proxy_headers":{},"service_token":"#####","ssl":{"certificate_authorities":["/mnt/elastic-internal/elasticsearch-association/elastic/elasticsearch-sample/certs/ca.crt"],"verification_mode":"full"},"type":"elasticsearch"},"@timestamp":"2024-10-14T16:35:35.55Z","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:35.612Z","message":"panic: runtime error: invalid memory address or nil pointer dereference","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x55df2cba3217]","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"goroutine 279 [running]:","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).configFromUnits(0xc000002240, {0x55df2d489218, 0xc000486370})","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:441 +0x97","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).start(0xc000002240, {0x55df2d489218, 0xc000486370})","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:344 +0x51","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).reconfigure(0xc0002fd728?, {0x55df2d489218?, 0xc000486370?})","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.012Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:387 +0x8d","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.013Z","message":"github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).Run.func5()","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.013Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:204 +0x5c5","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.148Z","message":"created by github.com/elastic/fleet-server/v7/internal/pkg/server.(*Agent).Run in goroutine 1","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.148Z","message":"/opt/buildkite-agent/builds/bk-agent-prod-aws-1726684516326467547/elastic/fleet-server-package-mbp/internal/pkg/server/agent.go:162 +0x416","component":{"binary":"fleet-server","dataset":"elastic_agent.fleet_server","id":"fleet-server-default","type":"fleet-server"},"log":{"source":"fleet-server-default"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.515Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":647},"message":"Component state changed fleet-server-default (STARTING->FAILED): Failed: pid '1214' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.515Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":665},"message":"Unit state changed fleet-server-default-fleet-server (STARTING->FAILED): Failed: pid '1214' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"FAILED"},"unit":{"id":"fleet-server-default-fleet-server","type":"input","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:36.516Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/application/coordinator.(*Coordinator).watchRuntimeComponents","file.name":"coordinator/coordinator.go","file.line":665},"message":"Unit state changed fleet-server-default (STARTING->FAILED): Failed: pid '1214' exited with code '2'","log":{"source":"elastic-agent"},"component":{"id":"fleet-server-default","state":"FAILED"},"unit":{"id":"fleet-server-default","type":"output","state":"FAILED","old_state":"STARTING"},"ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2024-10-14T16:36:45.612Z","log.origin":{"function":"github.com/elastic/elastic-agent/internal/pkg/agent/cmd.logReturn","file.name":"cmd/run.go","file.line":162},"message":"2 errors occurred:\n\t* timeout while waiting for managers to shut down: no response from runtime manager, no response from vars manager\n\t* config manager: failed to initialize Fleet Server: context deadline exceeded\n\n","log":{"source":"elastic-agent"},"ecs.version":"1.6.0"}
Error: 2 errors occurred:
	* timeout while waiting for managers to shut down: no response from runtime manager, no response from vars manager
	* config manager: failed to initialize Fleet Server: context deadline exceeded

From the logs it appears that fleet-server pod is looking for elasticsearch cluster at localhost instead of sending requests to elasticsearch service. There are other errors as well but I think this needs to be resolved first.

Errors in kibana pod:

[2024-10-14T16:17:47.714+00:00][ERROR][elasticsearch-service] Unable to retrieve version information from Elasticsearch nodes. Request timed out

Steps to Reproduce

Deploy ECK cluster using manifests mentioned above. Which are default for the most part with some changes.

Logs (if relevant)

No response

ALL-SPACE-Anas avatar Oct 14 '24 20:10 ALL-SPACE-Anas