docker-ejabberd icon indicating copy to clipboard operation
docker-ejabberd copied to clipboard

Clustering nodes that are on different servers

Open etranger7 opened this issue 1 year ago • 5 comments

I'm using this docker image and trying to cluster 2 nodes that are on different servers, therefore 2 different public IPs. Just for testing, I successfully clustered 2 docker containers that are on the same machine.

However, when I try to define a FQDN in ERLANG_NODE_ARG, I get an error that I don't know how to overcome.

This container starts without errors (I'm skipping unrelated lines):

services:
  ej1container:
    hostname: ej1container          # containername works here too
    environment:
      - ERLANG_NODE_ARG=ej1@ej1container

This setup gives me an error

services:
  ej1container:
    hostname: ej1container          # containername works here too
    environment:
      - [email protected]

It looks like the container starts normally but when I do

docker exec ej1container ejabberdctl status

I get

Failed RPC connection to the node '[email protected]': nodedown

I already pointed the A record of subdomain.domain.com to the public IP of the VPS where this is running.

There was a similar issue https://github.com/processone/docker-ejabberd/issues/106 but I don't see how the FQDN was integrated and what the solution was.

Any help would be much appreciated.

etranger7 avatar Sep 12 '24 15:09 etranger7

Update: While the main node is running on Server A as ej1@ej1container, I tried to add Server B to it to form a cluster and ran into these issues:

  • When I use A FQDN, I get
ej3con  | :> ejabberdctl join_cluster [email protected]
ej3con  | 
ej3con  | 21:31:47.574 [error] ** System NOT running to use fully qualified hostnames **
ej3con  | ** Hostname subdomain.domain.com is illegal **
ej3con  | 
ej3con  | Error: error
ej3con  | Error: "This node cannot reach that node."
ej3con  | :> FAILURE in command 'join_cluster [email protected]' !!! Stopping ejabberd...
  • When I use an IP instead, I get
ej3con  | :> ejabberdctl join_cluster [email protected]
ej3con  | 
ej3con  | 20:17:39.761 [error] ** System NOT running to use fully qualified hostnames **
ej3con  | ** Hostname xxx.xxx.xxx.xx is illegal **
ej3con  | 
ej3con  | Error: error
ej3con  | Error: "This node cannot reach that node."
ej3con  | :> FAILURE in command 'join_cluster [email protected]' !!! Stopping ejabberd...
ej3con  | [os_mon] memory supervisor port (memsup): Erlang has closed

etranger7 avatar Sep 12 '24 21:09 etranger7

- [email protected]

That environment variable is read by the ejabberdctl script, and it is passed to the erl virtual machine as the argument -sname (or -name when the value has subdomains with a dot .). As a result, the erlang virtual machine names itself as [email protected].


docker exec ej1container ejabberdctl status Failed RPC connection to the node '[email protected]': nodedown

I get that same problem with a similar compose file:

version: '3.7'

services:

  main:
    image: ghcr.io/badlop/ejabberd:dependabot
    container_name: ejabberd
    hostname: ej1container
    environment:
      - [email protected]
      - ERLANG_COOKIE=dummycookie123

The solution in my case is to add subdomain.domain.com to /etc/hosts inside the container. That way ejabberdctl is able to connect correctly to the running node and get the status.


ERLANG_NODE_ARG=ej1@ej1container ejabberdctl join_cluster [email protected] ** System NOT running to use fully qualified hostnames **

Right, you used the erlang short node name ej1container, so you cannot later use a long node name like sub.domains

Either use:

ERLANG_NODE_ARG=ej1@ej1container ejabberdctl join_cluster ej1@ej1container

If you use this in different machines, make sure the second one knows where to find ej1container (by adding it to /etc/hosts for example)

Or use:

[email protected] ejabberdctl join_cluster [email protected]

In that case, make sure erlang can know what does ej1container.domain.com point to.

badlop avatar Sep 20 '24 10:09 badlop

Thank you for your reply @badlop . Here is what worked for me to move past the "Failed RPC connection to the node '[email protected]': nodedown" message and get a positive STATUS message. In the docker compose file, I used

services:
  ejabberd:
    image: ejabberd/ecs:24.07
    container_name: ejabberd
    hostname: subdomain.domain.com
    environment:
      - CTL_ON_START=status
      - ERLANG_COOKIE=[removed]
      - [email protected]

However, when I try to connect to [email protected] that's on Server A, from Server B, I get

Error: error
Error: "This node cannot reach that node."

When I

docker exec ejabberd bin/ejabberdctl ping [email protected]

from Server B, I get pang.

When I ping Server A from Server B, I can reach it with no issues.

When I

docker exec -u root ejabberd ping subdomain.domain.com

from server B to Server A, again Server A is reachable.

I feel like I'm missing something here. Again, your help is much appreciated.

etranger7 avatar Sep 26 '24 22:09 etranger7

Hi @badlop , should I re-submit this issue under the issues of https://github.com/processone/ejabberd/ ? I'm wondering whether that's being more closely monitored and whether the issues with the containers should also be submitted there. Thanks.

etranger7 avatar Oct 08 '24 16:10 etranger7

This is a problem with that container image, so here seems a good place for the issue.

On the other hand, it may be a problem related to docker and erlang clustering, not only ejabberd, and you may search for related questions outside of ejabberd places.

badlop avatar Oct 10 '24 19:10 badlop