emqx-docker icon indicating copy to clipboard operation
emqx-docker copied to clipboard

Can't create a Docker Swarm Cluster

Open aksakalli opened this issue 6 years ago • 29 comments

  • Docker version 17.05.0-ce (for arm)
  • EMQ Version v2.2-rc.1

Hello,

I am running docker in swarm mode and want to deploy a MQTT cluster. I decided to create one master instance so that other replicated instances can join it. Here is the compose file I wrote for this goal:

version: "3"

services:
  emq-master:
    image: aksakalli/rpi-emq
    environment:
      - EMQ_HOST=emq-master
      - EMQ_NAME=master
      - EMQ_NODE__COOKIE=ef16498f66804df1cc6172f6996d5492
      - EMQ_NODE__NAME=master@emq-master
  emq-worker:
    image: aksakalli/rpi-emq
    depends_on: 
     - emq-master
    ports:
      - 18083:18083
      - 1883:1883
    deploy:
      replicas: 2
    environment:
      - EMQ_JOIN_CLUSTER=master@emq-master
      - EMQ_NODE__COOKIE=ef16498f66804df1cc6172f6996d5492

(I am using my own image for Raspberry Pi , it is basically the same as emqtt/emq-docker but compiled for arm)

When I deploy this stack, I am getting following log for emq-master container:

starting emqttd on node 'master@emq-master'
emqttd ctl is starting...[ok]
emqttd hook is starting...[ok]
emqttd router is starting...[ok]
emqttd pubsub is starting...[ok]
emqttd stats is starting...[ok]
emqttd metrics is starting...[ok]
emqttd pooler is starting...[ok]
emqttd trace is starting...[ok]
emqttd client manager is starting...[ok]
emqttd session manager is starting...[ok]
emqttd session supervisor is starting...[ok]
emqttd wsclient supervisor is starting...[ok]
emqttd broker is starting...[ok]
emqttd alarm is starting...[ok]
emqttd mod supervisor is starting...[ok]
emqttd bridge supervisor is starting...[ok]
emqttd access control is starting...[ok]
emqttd system monitor is starting...[ok]
Load emq_mod_presence module successfully.
Load emq_mod_subscription module successfully.
dashboard:http listen on 0.0.0.0:18083 with 2 acceptors.
mqtt:tcp listen on 127.0.0.1:11883 with 16 acceptors.
mqtt:tcp listen on 0.0.0.0:1883 with 64 acceptors.
mqtt:ws listen on 0.0.0.0:8083 with 16 acceptors.
mqtt:ssl listen on 0.0.0.0:8883 with 32 acceptors.
mqtt:wss listen on 0.0.0.0:8084 with 4 acceptors.
mqtt:api listen on 127.0.0.1:8080 with 4 acceptors.
emqttd 2.2 is running now
Node 'master@emq-master' not responding to pings.
['2017-06-29T09:17:55Z']:waiting emqttd
['2017-06-29T09:17:55Z']:timeout error

Apparently, master@emq-master can not be resolved within the container when I set EMQ_HOST.

I also tried to leave it blank, emqttd can be created for the default ip address (as [email protected]). However, emq-worker containers can not join the cluster (even though emq-master host(FQDN) can be resolved by these containers.) The logs from one of emq-worker container:

emqttd 2.2 is running now
['2017-06-29T11:18:50Z']:emqttd start
['2017-06-29T11:18:50Z']:emqttd try join master@emq-master
11:18:58.790 [error] ** System running to use fully qualified hostnames **
** Hostname emq-master is illegal **
Failed to join the cluster: {node_not_running,'master@emq-master'}

I connected to one of the worker containers and tried to connect to the master with the hostname again:

root@9df3d2a3d36a:/opt/emqttd/bin# emqttd_ctl cluster join master@emq-master        
Failed to join the cluster: {node_not_running,'master@emq-master'}

And this time, using the ip address, it worked!

root@9df3d2a3d36a:/opt/emqttd# emqttd_ctl cluster join [email protected]  
Join the cluster successfully.
Cluster status: [{running_nodes,['[email protected]','[email protected]']}]

I was planning to set a static ip for my master node, however swarm's overlay network driver does not support it (see Static/Reserved IP addresses for swarm services · Issue #24170 · moby/moby).

How can I create a emq cluster deployment properly?

aksakalli avatar Jun 29 '17 11:06 aksakalli

I also tried adding hostname parameter for emq-master but didn't work either:

services:
  emq-master:
    hostname: emq-master
...

aksakalli avatar Jun 29 '17 18:06 aksakalli

@aksakalli The Erlang node name should be Name@Host when clustering, where Host is IP address or the fully qualified host name. For example:

services:
  emq-master:
    image: aksakalli/rpi-emq
    environment:
      - EMQ_HOST=master.yourdomain
      - EMQ_NAME=emq
      - EMQ_NODE__COOKIE=ef16498f66804df1cc6172f6996d5492
      - [email protected]

emqplus avatar Jun 30 '17 01:06 emqplus

@aksakalli, I found the only way I could get the brokers up in clustered mode was if I specified FQDNs. Short hostnames didn't work and since IPs in Docker are dynamic, can't use those either. I assign the EMQ_HOST variable with an FQDN and then set the network alias for that container to the same FQDN. Here's the snippet from my compose file I use to bring up the EMQ services:

services:
  emq_main_1:
    image: emq
    environment:
      EMQ_NAME: emq
      EMQ_HOST: emq_main_1.mq.tt
    networks:
      backend:
        aliases:
          - emq_main_1.mq.tt
  emq_main_2:
    image: emq
    environment:
      EMQ_NAME: emq
      EMQ_HOST: emq_main_1.mq.tt
      EMQ_JOIN_CLUSTER: emq@emq_main_1.mq.tt
    networks:
      backend:
        aliases:
          - emq_main_2.mq.tt

MrOwen avatar Jul 05 '17 16:07 MrOwen

@MrOwen thank you very much, it works with network aliases!

One thing to point out: emq_main_2's EMQ_HOST should be emq_main_2.mq.tt in your snippet.

Here is my compose file:

version: "3"

services:
  emq-master:
    image: emq
    environment:
      - "EMQ_NAME=emq"
      - "EMQ_HOST=master.mq.tt"
      - "EMQ_NODE__COOKIE=ef16498f66804df1cc6172f6996d5492"
    networks:
      emq-cluster:
        aliases:
          - master.mq.tt
    ports:
      - 18083:18083
      - 1883:1883
  emq-worker:
    image: emq
    environment:
      - "[email protected]"
      - "EMQ_NODE__COOKIE=ef16498f66804df1cc6172f6996d5492"
    depends_on:
     - emq-master
    networks:
      emq-cluster:
    deploy:
      replicas: 2

networks:
  emq-cluster:

Now I can run my cluster with 3 instances, it works fine:

emq

Now I publish the cluster from the master instance.

emq-docker 1

My questions are:

  1. I publish everything from the master because I don't want the load balancer rout requests to the workers before they join the cluster. Is this the right approach? How can I possibly improve this for high availability.
  2. Since I have the dashboard from emq-master, do I need to load all default modules for emq-worker? Can I add EMQ_LOADED_PLUGINS="" variable for emq-worker?

aksakalli avatar Jul 05 '17 19:07 aksakalli

We have a script hook in https://github.com/emqtt/emq-docker/blob/master/start.sh#L151

You could create this script and do something in it about cluster.

vowstar avatar Jul 10 '17 02:07 vowstar

@aksakalli 1.- Its a fact of life that some clients will need to re-connect / wait to connect, there's no way to avoid this. By following this approach you're only ever using the master to handle connections / sessions from external clients of which surely there are much more than those connecting from inside the cluster, thus, mostly negating the main benefit of clustering (spreading the load) in the first place, I'd think... 2.- I've found (as I'm sure you have by now) that dashboards only show information regarding their own instance, they do not reflect the whole cluster...

On a general note, this clustering method is still weak in the face of a master being unavailable when a worker connects, something that constantly re-attaches workers to the master (or a completely different approach) would be needed.

je-al avatar Jul 20 '17 15:07 je-al

@aksakalli

Please have a look at the 2.3 beta version of EMQ. It adds autodiscovery. http://emqttd-docs.readthedocs.io/en/latest/config.html#emq-cluster https://github.com/emqtt/emqttd/blob/v2.3-beta.1/etc/emq.conf#L12

I tried both multicast and etcd, and they both work (had to manually create the node dir for etcd).

Just change ENV EMQ_VERSION=v2.3-beta.1 in the Dockerfile

and then start the containers with the following arguments:

Etcd:

# Create '/emq/emq/nodes' directory in your Etcd cluster. Python example using python-etcd:
>>> import etcd
>>> c = etcd.Client(host='ETCD_HOST', port=2379)
>>> c.write('/emq/emq/nodes', None, dir=True)


docker run --rm -ti \
    -p 18083:18083 \
    -p 1883:1883 \
    -p 8083:8083 \
    --env "EMQ_CLUSTER__NAME=emq" \
    --env "EMQ_CLUSTER__DISCOVERY=etcd" \
    --env "EMQ_CLUSTER__AUTOHEAL=on" \
    --env "EMQ_CLUSTER__AUTOCLEAN=3m" \
    --env "EMQ_CLUSTER__ETCD__SERVER=http:\/\/ETCD_HOST:2379" \
    --env "EMQ_CLUSTER__ETCD__PREFIX=emq" \
    --env "EMQ_CLUSTER__ETCD__NODE_TTL=1m" \
    YOUR-REPO-HERE/emq:2.3-beta

# Run the same, but skip /change the ports for consecutive nodes.

Multicast:

docker run --rm -ti \
    -p 18083:18083 \
    -p 1883:1883 \
    -p 8083:8083 \
    --env "EMQ_CLUSTER__NAME=emq" \
    --env "EMQ_CLUSTER__DISCOVERY=mcast" \
    --env "EMQ_CLUSTER__AUTOHEAL=on" \
    --env "EMQ_CLUSTER__AUTOCLEAN=3m" \
    --env "EMQ_CLUSTER__MCAST__ADDR=239.192.0.1" \
    --env "EMQ_CLUSTER__MCAST__PORTS=4369,4370" \
    --env "EMQ_CLUSTER__MCAST__IFACE=0.0.0.0" \
    --env "EMQ_CLUSTER__MCAST__TTL=255" \
    --env "EMQ_CLUSTER__MCAST__LOOP=on" \
    YOUR-REPO-HERE/emq:2.3-beta

# Run the same, but skip /change the ports for consecutive nodes.

I hope the above helps.

vivobg avatar Jul 29 '17 13:07 vivobg

I had to make a few tweaks to bring a cluster up using DNS auto discovery and docker swarm:

version: "3"
services:
  mqtt:
    networks:
      proxy:
      mqtt:
      default:
        aliases:
          - mymqtt
    deploy:
      replicas: 12
    ports:
      - 1883:1883 # MQTT
    image: chrisns/emq:v2.3-beta.3-hacked
    environment:
      - EMQ_CLUSTER__DNS__NAME=tasks.mymqtt
      - EMQ_NAME=emq
      - EMQ_CLUSTER__DISCOVERY=dns
      - EMQ_CLUSTER__AUTOHEAL=on
      - EMQ_CLUSTER__AUTOCLEAN=30s
      - EMQ_CLUSTER__DNS__APP=emq
networks:
  default:
    external: false
  mqtt:
    external: true
  proxy:
    external: true
docker stack deploy -c docker-compose.yml mqtt

The main thing that wasn't working that needed to be hacked was the IP address determination in the start.sh is way to simple. My script figures out what IP of the container is on the aliased network and uses that for self identification and communication between the containers, though my solution is a bit specific for DNS based

Aside from that it's annoying that the default emq.conf has lines commented out, so to maintain the nice env var replacer thing in the in built start.sh you have to remove the #'s

In other related news I built a thing that automagically builds+pushes docker images for all the releases and a -hacked with my patches https://hub.docker.com/r/chrisns/emq/tags/

Code is here: https://github.com/chrisns/docker-emq

This is super self serving and not really sensible enough for me to make a PR with any of it, but hopefully sharing my solution/hacks will help someone :)

chrisns avatar Aug 21 '17 16:08 chrisns

@chrisns I had run into the IP issue before and had settled on assigning a specific subnet to the overlay network to be used for the cluster and a custom variable to signal its prefix for matching with the available addresses inside the container (which is way more complicated)... but, yeah, something to aid the process into choosing the "right" network to get its "name" of off is needed. This actually looks fine, except that an extra variable (not related to an specific clustering solution) might be needed. The replacer works fine for commented lines, it's just that the regex is not correctly matching whitespace, I submitted a patch for it but it got rolled back later on...

je-al avatar Aug 31 '17 00:08 je-al

I eventually decided to abandon work on this for now. If the cluster comes up too fast the nodes don't discover each other, or worse they discover some other nodes, so you can end up with clusters I found spinning up 12 containers could easily result in a cluster of 4, another cluster of 5 and then 3 unclustered nodes. -- which is pretty annoying/pointless, really hoped the auto discovery thing would run all the time not just at startup

chrisns avatar Aug 31 '17 10:08 chrisns

Anyone has some news about this issue?

RaymondMouthaan avatar Feb 04 '18 18:02 RaymondMouthaan

This does not seem to work, if you provide the DNS it will resolve to another IP, the load balancer most likely and not the node ip.

What I did is I mounted docker.sock and I got the Ips from there using python, and used cluster.sh to try and join manually the ips from there.

I hope in the future the developers will consider a viable solution for Docker, because mcast does not work with overlay and also etcd is not a good solution.

purplesrl avatar Oct 24 '18 11:10 purplesrl

@purplesrl Can you share your solution? I'm looking for a good solution which allows me use dockerized emq clusters in Amazon ECS.

tomaszwostal avatar Nov 07 '18 06:11 tomaszwostal

Has someone figured out a way to create a docker swarm/docker-compose cluster in emqx version 3? I have tried some of the suggested ways here and haven't found a solution yet.

optionsome avatar Dec 13 '18 16:12 optionsome

@optionsome may I point you to -> https://github.com/emqx/emqx-docker/pull/91#issue-233811388 ?

RaymondMouthaan avatar Dec 13 '18 16:12 RaymondMouthaan

@purplesrl Can you share your solution? I'm looking for a good solution which allows me use dockerized emq clusters in Amazon ECS.

@tomaszwostal Unfortunately not, the code I developed I made at work... but I outlined the steps, the idea is to find the IPs and then join the nodes manually, because on docker the automatic way is not working mainly because docker swarm provides a load-balancer IP but emqx requires the actual IP of the node.

purplesrl avatar Dec 13 '18 16:12 purplesrl

@RaymondMouthaan thanks a lot! I was able to get the clustering to work. I don't know what my problem was earlier as what I was trying was really similar to your solution. Was just missing the hostname and volume definitions.

optionsome avatar Dec 13 '18 16:12 optionsome

@optionsome, good to hear you made it work 👍🏽. One note to this is -- when emqx-worker is started faster than emqx-master, you might end up with two individual emqx instances, instead of clustered ones. Solution : just restart the worker container

RaymondMouthaan avatar Dec 13 '18 16:12 RaymondMouthaan

@RaymondMouthaan I copy your example of a docker compose file and run it,but it doesn't work,it doesn't clustered. I restarted the worker container,It's still the same.Did I do anything wrong?Look forward to your reply

Rebellioncry avatar Nov 19 '19 02:11 Rebellioncry

@Rebellioncry, apologises but i am no longer using emqx as mqtt broker for a while now. @zhanghongtong might be able to help you.

RaymondMouthaan avatar Nov 19 '19 05:11 RaymondMouthaan

@RaymondMouthaan Thank you for your reply! @zhanghongtong How to create a docker swarm cluster now? Can you give me an example?Look forward to your reply

Rebellioncry avatar Nov 19 '19 07:11 Rebellioncry

@Rebellioncry Hi, An example of docker-compose.yaml is as follows

version: '3'

services:
  emqx1:
    image: emqx/emqx:v3.2.5
    environment:
    - "EMQX_NAME=emqx"
    - "EMQX_HOST=node1.emqx.io"
    - "EMQX_CLUSTER__DISCOVERY=static"
    - "[email protected], [email protected]"
    networks:
      emqx-net:
        aliases:
        - node1.emqx.io
  
  emqx2:
    image: emqx/emqx:v3.2.5
    environment:
    - "EMQX_NAME=emqx2"
    - "EMQX_HOST=node2.emqx.io"
    - "EMQX_CLUSTER__DISCOVERY=static"
    - "[email protected], [email protected]"
    networks:
      emqx-net:
        aliases:
        - node2.emqx.io

networks:
  emqx-net:

Execute docker-compose up

$ docker-compose up
Creating tmp_emqx1_1 ... done
Creating tmp_emqx2_1 ... done
Attaching to tmp_emqx2_1, tmp_emqx1_1
emqx1_1  | node.max_ports=1048576
emqx2_1  | node.max_ports=1048576
emqx2_1  | listener.tcp.external.acceptors=64
emqx2_1  | listener.ssl.external.acceptors=32
emqx2_1  | node.process_limit=2097152
emqx2_1  | node.max_ets_tables=2097152
emqx2_1  | cluster.discovery=static
emqx2_1  | cluster.discovery=static
emqx2_1  | listener.ws.external.acceptors=16
emqx2_1  | [email protected]
emqx2_1  | [email protected], [email protected]
emqx2_1  | [email protected], [email protected]
emqx1_1  | listener.tcp.external.acceptors=64
emqx1_1  | listener.ssl.external.acceptors=32
emqx1_1  | node.process_limit=2097152
emqx1_1  | node.max_ets_tables=2097152
emqx1_1  | cluster.discovery=static
emqx1_1  | cluster.discovery=static
emqx1_1  | listener.ws.external.acceptors=16
emqx1_1  | [email protected]
emqx1_1  | [email protected], [email protected]
emqx1_1  | [email protected], [email protected]
emqx2_1  | emqx v3.2.5 is started successfully!
emqx1_1  | emqx v3.2.5 is started successfully!
emqx2_1  | 2019-11-19 13:14:57.259 [critical] [EMQ X] emqx shutdown for join
emqx2_1  | ['2019-11-19T13:15:00Z']:emqx start
emqx1_1  | ['2019-11-19T13:15:01Z']:emqx start

$ docker exec -it tmp_emqx1_1 sh -c "emqx_ctl cluster status"
Cluster status: [{running_nodes,['[email protected]','[email protected]']}]

Rory-Z avatar Nov 19 '19 13:11 Rory-Z

@zhanghongtong thanks a lot!Your example works well!

Rebellioncry avatar Nov 20 '19 07:11 Rebellioncry

@Rebellioncry You are welcome :)

Rory-Z avatar Nov 20 '19 08:11 Rory-Z

@aaamitsingh I'm sorry we don't have an example yet

Rory-Z avatar Jan 16 '20 01:01 Rory-Z

@zhanghongtong Still only possible Autocluster by static node list?

renatomotorline avatar Apr 17 '20 15:04 renatomotorline

Hi @renatomotorline, you can refer to our documentation

Rory-Z avatar Apr 20 '20 01:04 Rory-Z

@zhanghongtong I read the documentation but I only successfully make the cluster works with static node list like the example that you put above, you have any example with dns, multicast or etcd?

renatomotorline avatar Apr 20 '20 10:04 renatomotorline

@renatomotorline Sorry, we don't have an example of DNS, multicast and etcd clusters

Rory-Z avatar Apr 21 '20 01:04 Rory-Z