elasticsearch icon indicating copy to clipboard operation
elasticsearch copied to clipboard

ES Universe package on DC/OS Packet does not run

Open olafmol opened this issue 8 years ago • 19 comments

It keeps on deploying, waiting, failing on DC/OS 1.7.x on Packet. It seems to be unable to bind to expected ports.

olafmol avatar May 16 '16 11:05 olafmol

Please provide steps and configuration to recreate. We don't use Packet and don't test on DC/OS. Just plain old Mesos.

philwinder avatar May 16 '16 12:05 philwinder

Using this Terraform script: https://dcos.io/docs/1.7/administration/installing/cloud/packet/ After a successful install go to "Universe" in DC/OS dashboard, and install ES package. Same issue appears when using this Marathon installation instruction: http://mesos-elasticsearch.readthedocs.io/en/latest/#getting-started

(BTW, it seems to work correctly when installing DC/OS on Google Cloud, so it might be a specific Packet thing).

olafmol avatar May 16 '16 12:05 olafmol

Ok, thanks. I can't vouch for the DC/OS installer, as that hasn't been updated for a long time. But the marathon command should work.

When you say expected ports, how are you specifying them? By default, ES lets mesos pick random ports from its pool. You can override this using the elasticsearchPorts option.

philwinder avatar May 16 '16 12:05 philwinder

I don't specify a specific port.

olafmol avatar May 16 '16 12:05 olafmol

The issue seems to be that Elastic's java can't get the local address:

java.net.UnknownHostException: zac-dcos-agent-03: zac-dcos-agent-03: unknown error atjava.net.InetAddress.getLocalHost(InetAddress.java:1505)`

zsmithnyc avatar May 16 '16 16:05 zsmithnyc

@philwinder how does this container attempt to get its address? Is it using a meta data service?

zsmithnyc avatar May 16 '16 16:05 zsmithnyc

in my case the problem is a static configured --default.network.publish_host=_non_loopback:ipv4_. I have tested this with DC/OS on Docker and the executor always takes the IPv4 of the spartan interface. A solution can be --default.network.publish_host=$(hostname -i). Maybe it's possible to implement a parameter for this setting e.g. --executorNetworkPublishHost=_non_loopback:ipv4_

jstabenow avatar May 24 '16 19:05 jstabenow

@zsmith928 I also had trouble with this package on DCOS-Docker and have tried to find a solution. Would be nice if you can verify if it also runs on your system.

Just do:

dcos package repo add universe-jstabenow https://github.com/jstabenow/dcos-packages/archive/version-2.x.zip
dcos package install elasticsearch

Here is my workaround for the wrong "publish_host" on executor: https://github.com/jstabenow/docker-images/tree/master/dcos-elasticsearch

I replace only the argument of the framework by --default.network.publish_host=$LIBPROCESS_IP

jstabenow avatar May 26 '16 08:05 jstabenow

update: https://github.com/mesos/elasticsearch/pull/569

jstabenow avatar May 27 '16 13:05 jstabenow

Unfortunately, taking @jstabenow's helpful repo for a spin doesn't seem to help us. We're seeing the same thing --- in which Java complains about not knowing what the AWS-supplied hostname ip-10-1-23-254 is, and then failing to bind to local host.

jbirch avatar May 28 '16 03:05 jbirch

hey @jbirch

that sounds after a similar problem.

On Google there are many articles about network problems with Elasticsearch / Java. That's why I added the publish_host as a parameter.

In my case has Elasticsearch elected the wrong interface for the publish_host. In your case it is a problem with the resolution of the elected interface. = Let's play with this parameter.

Can you post the executor log and the results if you execute the following commands on your machine "ip-10-1-23-254"?:

$ docker run -it --net=host elasticsearch:latest --default.network.publish_host="10.1.23.254"
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host="ip-10-1-23-254"
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=$(hostname -i)

That would be the right log:

[2016-05-29 12:19:40,859][INFO ][transport                ] [Storm] publish_address {10.1.23.254:9300}, bound_addresses {[::]:9300}
[2016-05-29 12:19:44,093][INFO ][cluster.service          ] [Storm] new_master {Storm}{N93bF9aPT1SEaqsHGsF6Eg}{10.1.23.254}{10.1.23.254:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2016-05-29 12:19:44,210][INFO ][http                     ] [Storm] publish_address {10.1.23.254:9200}, bound_addresses {[::]:9200}

And can you post the available ENV of a running Docker container of your DC/OS cluster?

Hope we can find the problem and the right setting for you.

jstabenow avatar May 29 '16 12:05 jstabenow

Hey @jstabenow, thanks for taking the time to reply on the weekend to a stranger. I appreciate it.

With respect to your commands:

"10.1.23.19: Comes up and binds to the given IP. "ip-10-1-23-19: Fails to resolve ip-10-1-23-19, and then fails to start $(hostname -i):

ERROR: Parameter [fe80::42:f5ff:feb0:2cb1%docker0]does not start with --

"$(hostname -i)":

java.net.UnknownHostException: no such interface eth0 fe80::42:f5ff:feb0:2cb1%docker0 fe80::707a:26ff:feb3:dbb1%spartan fe80::8045:21ff:fe59:a821%veth6d620c6 fe80::a8f5:c6ff:fee7:3af3%veth8af0e4f 10.1.23.19 172.17.0.1 198.51.100.1 198.51.100.2 198.51.100.3

"$LIPPROCESS_IP": Starts up and binds to 198.51.100.1.

The issue here is that I've got no issues starting elasticsearch:latest in DC/OS. It'll bind to 198.51.100.1 and start, much the same as if I didn't provide the --default.network.publish_host argument. My hope was that your package would help with mesos/elasticsearch-scheduler having a bad time.

Regarding an existing env, here's the output of docker inspect --format '{{ .Config.Env }}' 7ef131bf3c5a | tr ' ' '\n' on the Universe-provided weavescope-probe container:

[MARATHON_APP_LABEL_DCOS_PACKAGE_SOURCE=https://universe.mesosphere.com/repo MARATHON_APP_VERSION=2016-05-24T19:28:35.443Z HOST=10.1.23.19 MARATHON_APP_RESOURCE_CPUS=0.05 MARATHON_APP_LABEL_DCOS_PACKAGE_REGISTRY_VERSION=2.0 PORT_10102=18179 MARATHON_APP_LABEL_DCOS_PACKAGE_RELEASE=1 MARATHON_APP_DOCKER_IMAGE=weaveworks/scope:0.15.0 MARATHON_APP_LABEL_DCOS_PACKAGE_NAME=weavescope-probe MARATHON_APP_LABEL_DCOS_PACKAGE_VERSION=0.15.0 MESOS_TASK_ID=weavescope-probe.f4fc1fba-21e5-11e6-b902-e6205eb290e4 PORT=18179 MARATHON_APP_RESOURCE_MEM=256.0 PORTS=18179 MARATHON_APP_LABEL_DCOS_PACKAGE_IS_FRAMEWORK=true MARATHON_APP_RESOURCE_DISK=0.0 MARATHON_APP_LABELS=DCOS_PACKAGE_RELEASE DCOS_PACKAGE_SOURCE DCOS_PACKAGE_REGISTRY_VERSION DCOS_PACKAGE_VERSION DCOS_PACKAGE_NAME DCOS_PACKAGE_IS_FRAMEWORK MARATHON_APP_ID=/weavescope-probe PORT0=18179 LIBPROCESS_IP=10.1.23.19

jbirch avatar May 29 '16 21:05 jbirch

Hey @jbirch no problem :-) Please try ${ENV} instead of $ENV

$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=${LIBPROCESS_IP}
$ docker run -it --net=host elasticsearch:latest --default.network.publish_host=${HOST}

This two ENV should work:

HOST=10.1.23.19
LIBPROCESS_IP=10.1.23.19

jstabenow avatar May 29 '16 21:05 jstabenow

Ah sorry ... this can't work because it's not created by Mesos = No ENV ;-) Please try my ES-Package again and replace ${LIBPROCESS_IP} with ${HOST}. But that was supposed to be the same. Strange....

bildschirmfoto 2016-05-30 um 00 14 06

jstabenow avatar May 29 '16 22:05 jstabenow

Hi all. Thanks @jstabenow for continuing to help out on this. To answer a previous question:

  • The executors are elasticsearch. So the executors obtain their IP address according to the elasticsearch code. AFAIK, it's a typical java InetAddress call, which get's the first available adapter.

Remember that you can pass your own settings file and that the ES containers can be overridden. So I would oppose any core code changes that could otherwise be achieved by this.

philwinder avatar May 30 '16 07:05 philwinder

Hey @philwinder No problem. I will close my PR.

jstabenow avatar May 30 '16 14:05 jstabenow

Hi @philwinder,

We still have the case where mesos/elasticsearch-scheduler, when either installed via Universe or via the instructions at https://mesos-elasticsearch.readthedocs.io/en/latest/#how-to-install-on-marathon, fails to work 'out-of-the-box' where mesos/elasticsearch does work. It looks like this case might be limited to just the default resolver settings when you bring up the world in AWS, but I think (apropos of no hard data) that it'd be a common configuration.

Note that the thing that fails to do the binding is https://github.com/mesos/elasticsearch/blob/1.0.1/commons/src/main/java/org/apache/mesos/elasticsearch/common/util/NetworkUtils.java:30, not Elasticsearch itself.

Noting that there's myriad deployment options of the underlying platform on which mesos/elasticsearch-scheduler can run, I don't want to ask anyone to be in the business of making specific changes to support one particular option where it works generally.

Caveat here being that maybe it's actually totally fine and my environment is just screwed up :)

jbirch avatar May 30 '16 23:05 jbirch

@jbirch I did all my manual testing on AWS, so I'm surprised there's a problem here. But I used vanilla Mesos, not DCOS, so I assume it's some difference there.

Can you post the log that is showing the error? That might help decide what to do.

Thanks, Phil

philwinder avatar May 31 '16 15:05 philwinder

I'm almost certain it's an issue on our end, and isn't indicative of the package itself generally "not working".

I would expect something like dig -tANY $(hostname) @169.254.169.253 +short to work out-of-the-box on any AWS instance with DNS enabled in the VPC. In our case, it doesn't, and I think that's why we eventually fail to run mesos/elasticsearch-scheduler (I'd suspect the default resolver of 198.51.100.1 eventually chains up to it).

Tentatively let's call this one a layer 8 problem and I'll try and get things shored up on our end. It really does look more like "DNS isn't 100%" rather than "mesos/elasticsearch-schedular has a bug".

jbirch avatar May 31 '16 16:05 jbirch