Clustering nodes that are on different servers
I'm using this docker image and trying to cluster 2 nodes that are on different servers, therefore 2 different public IPs. Just for testing, I successfully clustered 2 docker containers that are on the same machine.
However, when I try to define a FQDN in ERLANG_NODE_ARG, I get an error that I don't know how to overcome.
This container starts without errors (I'm skipping unrelated lines):
services:
ej1container:
hostname: ej1container # containername works here too
environment:
- ERLANG_NODE_ARG=ej1@ej1container
This setup gives me an error
services:
ej1container:
hostname: ej1container # containername works here too
environment:
- [email protected]
It looks like the container starts normally but when I do
docker exec ej1container ejabberdctl status
I get
Failed RPC connection to the node '[email protected]': nodedown
I already pointed the A record of subdomain.domain.com to the public IP of the VPS where this is running.
There was a similar issue https://github.com/processone/docker-ejabberd/issues/106 but I don't see how the FQDN was integrated and what the solution was.
Any help would be much appreciated.
Update: While the main node is running on Server A as ej1@ej1container, I tried to add Server B to it to form a cluster and ran into these issues:
- When I use A FQDN, I get
ej3con | :> ejabberdctl join_cluster [email protected]
ej3con |
ej3con | 21:31:47.574 [error] ** System NOT running to use fully qualified hostnames **
ej3con | ** Hostname subdomain.domain.com is illegal **
ej3con |
ej3con | Error: error
ej3con | Error: "This node cannot reach that node."
ej3con | :> FAILURE in command 'join_cluster [email protected]' !!! Stopping ejabberd...
- When I use an IP instead, I get
ej3con | :> ejabberdctl join_cluster [email protected]
ej3con |
ej3con | 20:17:39.761 [error] ** System NOT running to use fully qualified hostnames **
ej3con | ** Hostname xxx.xxx.xxx.xx is illegal **
ej3con |
ej3con | Error: error
ej3con | Error: "This node cannot reach that node."
ej3con | :> FAILURE in command 'join_cluster [email protected]' !!! Stopping ejabberd...
ej3con | [os_mon] memory supervisor port (memsup): Erlang has closed
- [email protected]
That environment variable is read by the ejabberdctl script, and it is passed to the erl virtual machine as the argument -sname (or -name when the value has subdomains with a dot .). As a result, the erlang virtual machine names itself as [email protected].
docker exec ej1container ejabberdctl status Failed RPC connection to the node '[email protected]': nodedown
I get that same problem with a similar compose file:
version: '3.7'
services:
main:
image: ghcr.io/badlop/ejabberd:dependabot
container_name: ejabberd
hostname: ej1container
environment:
- [email protected]
- ERLANG_COOKIE=dummycookie123
The solution in my case is to add subdomain.domain.com to /etc/hosts inside the container. That way ejabberdctl is able to connect correctly to the running node and get the status.
ERLANG_NODE_ARG=ej1@ej1container ejabberdctl join_cluster [email protected] ** System NOT running to use fully qualified hostnames **
Right, you used the erlang short node name ej1container, so you cannot later use a long node name like sub.domains
Either use:
ERLANG_NODE_ARG=ej1@ej1container ejabberdctl join_cluster ej1@ej1container
If you use this in different machines, make sure the second one knows where to find ej1container (by adding it to /etc/hosts for example)
Or use:
[email protected] ejabberdctl join_cluster [email protected]
In that case, make sure erlang can know what does ej1container.domain.com point to.
Thank you for your reply @badlop . Here is what worked for me to move past the "Failed RPC connection to the node '[email protected]': nodedown" message and get a positive STATUS message. In the docker compose file, I used
services:
ejabberd:
image: ejabberd/ecs:24.07
container_name: ejabberd
hostname: subdomain.domain.com
environment:
- CTL_ON_START=status
- ERLANG_COOKIE=[removed]
- [email protected]
However, when I try to connect to [email protected] that's on Server A, from Server B, I get
Error: error
Error: "This node cannot reach that node."
When I
docker exec ejabberd bin/ejabberdctl ping [email protected]
from Server B, I get pang.
When I ping Server A from Server B, I can reach it with no issues.
When I
docker exec -u root ejabberd ping subdomain.domain.com
from server B to Server A, again Server A is reachable.
I feel like I'm missing something here. Again, your help is much appreciated.
Hi @badlop , should I re-submit this issue under the issues of https://github.com/processone/ejabberd/ ? I'm wondering whether that's being more closely monitored and whether the issues with the containers should also be submitted there. Thanks.
This is a problem with that container image, so here seems a good place for the issue.
On the other hand, it may be a problem related to docker and erlang clustering, not only ejabberd, and you may search for related questions outside of ejabberd places.