beats Elastic Agent won't start if the default rpc port is used

Version: Elastic Agent 7.14.2
Operating System: Ubuntu 20.04
Steps to Reproduce:

Elastic Agent will bind to port 6789 by default, but when that port is used by another application, agent cannot be started.

Enrollment/installation (via fleet server) doesn't return any error, only INFO messages:

root@server:~/elastic-agent-7.14.2-linux-x86_64# ./elastic-agent install -f --url=https://URL:443 --enrollment-token=TOKEN
2021-09-22T18:40:00.862+0200	INFO	cmd/enroll_cmd.go:396	Starting enrollment to URL: https://URL:443/
2021-09-22T18:40:02.024+0200	INFO	cmd/enroll_cmd.go:232	Elastic Agent might not be running; unable to trigger restart
2021-09-22T18:40:02.024+0200	INFO	cmd/enroll_cmd.go:234	Successfully triggered restart on running Elastic Agent.
Successfully enrolled the Elastic Agent.
Elastic Agent has been successfully installed.

The messages indicate that the agent has been restarted, enrolled and installed... But agent is not running and all we see in Kibana is that the agent is "Updating"

Only journalctl shows the real problem:

sep 22 18:52:05 server systemd[1]: Started Elastic Agent is a unified agent to observe, monitor and protect your system..
sep 22 18:52:05 server elastic-agent[2085818]: starting GRPC listener: listen tcp 127.0.0.1:6789: bind: address already in use
sep 22 18:52:05 server systemd[1]: elastic-agent.service: Main process exited, code=exited, status=1/FAILURE
sep 22 18:52:05 server systemd[1]: elastic-agent.service: Failed with result 'exit-code'

So we have to edit elastic-agent.yml under /opt/Elastic/Agent, and add a different grpc port:

agent.grpc:
  address: localhost
  port: 16789

And then run /opt/Elastic/Agent/elastic-agent restart

Elastic Agent should at least identify this port collision during the installation and display an error message warning user about the problem

Sep 24 '21 11:09 psanz-estc

cc: @EricDavisX

Sep 24 '21 11:09 psanz-estc

Pinging @elastic/agent (Team:Agent)

Sep 28 '21 00:09 elasticmachine

At a minimum, we can probably detect this and put better error logging in place to help triage.

Sep 28 '21 00:09 EricDavisX

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

Sep 28 '22 01:09 botelastic[bot]

👍

This is causing issues for my org when a kubernetes deployment of a standalone agent on a cluster with a daemonset running on each node gets this error.

Oct 05 '22 18:10 barrettsmits

👍

This is causing issues for my org when a kubernetes deployment of a standalone agent on a cluster with a daemonset running on each node gets this error.

After going line by line, the deployment and Daemonset cannot use the hostNetwork: True flag

Spec:
      hostNetwork: true

Changed to False solved this error on our deployment to a Kubernetes cluster with the Daemonset already running.

Possible documentation to prevent others from running into it?

_Per this page: https://www.elastic.co/guide/en/fleet/master/running-on-kubernetes-managed-by-fleet.html

Deploying Elastic Agent to collect cluster-level metrics in large cluster The size and the number of nodes in a Kubernetes cluster can be fairly large at times, and in such cases the Pod that will be collecting cluster level metrics might face performance issues due to resources limitations. In this case users might consider to avoid using the leader election strategy and instead run a dedicated, standalone Elastic Agent instance using a Deployment in addition to the DaemonSet._

Oct 11 '22 13:10 barrettsmits

👍 This is causing issues for my org when a kubernetes deployment of a standalone agent on a cluster with a daemonset running on each node gets this error.

After going line by line, the deployment and Daemonset cannot use the hostNetwork: True flag
Spec:
      hostNetwork: true
Changed to False solved this error on our deployment to a Kubernetes cluster with the Daemonset already running.

Possible documentation to prevent others from running into it?

_Per this page: https://www.elastic.co/guide/en/fleet/master/running-on-kubernetes-managed-by-fleet.html

Deploying Elastic Agent to collect cluster-level metrics in large cluster The size and the number of nodes in a Kubernetes cluster can be fairly large at times, and in such cases the Pod that will be collecting cluster level metrics might face performance issues due to resources limitations. In this case users might consider to avoid using the leader election strategy and instead run a dedicated, standalone Elastic Agent instance using a Deployment in addition to the DaemonSet._

Thanks for this, saved us a bunch of time, we wanted to run synthetics browser monitors but as the normal DaemonSet requires runAsUser: 0 and synthetics requires runAsUser: 1000 we needed to combine hostNetwork: false and runAsUser: 1000 for that to work. Thanks

Dec 16 '22 00:12 slacksach

~Still experiencing this with v8.5.3 and editing /opt/Elastic/Agent/elastic-agent.reference.yml doesn't work as the installation seems to have failed (Fleet says the agent status is "updating") and a restart just throws the following socket error:~

$ sudo /opt/Elastic/Agent/elastic-agent restart
Error: Failed trigger restart of daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/elastic-agent.sock: connect: no such file or directory"
Usage:
  elastic-agent restart [flags]

Flags:
  -h, --help   help for restart

Global Flags:
  -c, --c string                     Configuration file, relative to path.config (default "elastic-agent.yml")
  -d, --d string                     Enable certain debug selectors
  -e, --e                            Log to stderr and disable syslog/file output
      --environment environmentVar   set environment being ran in (default default)
      --path.config string           Config path is the directory Agent looks for its config file (default "/opt/Elastic/Agent")
      --path.downloads string        Downloads path contains binaries Agent downloads
      --path.home string             Agent root path (default "/opt/Elastic/Agent")
      --path.install string          Install path contains binaries Agent extracts
      --path.logs string             Logs path contains Agent log output (default "/opt/Elastic/Agent")
  -v, --v                            Log at INFO level

Failed trigger restart of daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/elastic-agent.sock: connect: no such file or directory"

EDIT: Just noticed I changed the wrong file: /opt/Elastic/Agent/elastic-agent.yml is the right one, and the suggested change from the original post works. But, sudo elastic-agent restart didn't work for me. However, sudo systemctl restart elastic-agent did.

Jan 12 '23 16:01 dmgeurts

Hi! We just realized that we haven't looked into this issue in a while. We're sorry!

We're labeling this issue as Stale to make it hit our filters and make sure we get back to it as soon as possible. In the meantime, it'd be extremely helpful if you could take a look at it as well and confirm its relevance. A simple comment with a nice emoji will be enough :+1. Thank you for your contribution!

Jan 12 '24 17:01 botelastic[bot]

beats beats copied to clipboard

Elastic Agent won't start if the default rpc port is used

beats
beats copied to clipboard