docker-maxscale icon indicating copy to clipboard operation
docker-maxscale copied to clipboard

Docker 1.12 services - After Galera scale down maxscale's autodiscovery runs into trouble

Open Franselbaer opened this issue 7 years ago • 8 comments

If you scale up and then scale down the galera cluster the auto discovery of maxscale runs into trouble.

The command used in entrypoint script:

getent hosts tasks.dbcluster

This delivers N cluster ip's from dbcluster correctly BUT:

If you do something like:

docker service scale dbcluster=10

and then:

docker service scale dbcluster=5

The instance list:

docker service ps dbcluster

shows something like:

ID                         NAME             IMAGE                    NODE     DESIRED STATE  CURRENT STATE          ERROR
0s4hgq9tm28xmp3padelhq258  dbcluster.1      toughiq/mariadb-cluster  doswa-5  Running        Running 17 hours ago
3f2b2q0rs4i2yzy92ohue7dlq  dbcluster.2      toughiq/mariadb-cluster  doswa-4  Running        Running 17 hours ago
2ks1kl7einrlnbzkh8aayz9oq   \_ dbcluster.2  toughiq/mariadb-cluster  doswa-4  Shutdown       Shutdown 17 hours ago
0xgbr3q3wavzkk5bvagby8xyu  dbcluster.3      toughiq/mariadb-cluster  doswa-4  Running        Running 17 hours ago
bdsbd10u203pjj2kyvawohw23   \_ dbcluster.3  toughiq/mariadb-cluster  doswa-3  Shutdown       Shutdown 17 hours ago
6m92mbed7hrc2w0cnwfn7c66d  dbcluster.4      toughiq/mariadb-cluster  doswa-5  Running        Running 17 hours ago
9ky7bh2wewsqgx0pptzjkpaqm   \_ dbcluster.4  toughiq/mariadb-cluster  doswa-5  Shutdown       Shutdown 17 hours ago
as90l1abljf8seojivtyu265y   \_ dbcluster.4  toughiq/mariadb-cluster  doswa-5  Shutdown       Shutdown 17 hours ago
2ms4ilr6hbh9fovjixc1a0npi  dbcluster.5      toughiq/mariadb-cluster  doswa-5  Shutdown       Shutdown 17 hours ago
aavba7zhv7y9z77vsgyaab03n   \_ dbcluster.5  toughiq/mariadb-cluster  doswa-4  Shutdown       Shutdown 17 hours ago
d1in2lunlab6qfj3p0kbks288  dbcluster.6      toughiq/mariadb-cluster  doswa-4  Shutdown       Shutdown 17 hours ago
btm75qwpa8oi1fg07qkvnpf9t   \_ dbcluster.6  toughiq/mariadb-cluster  doswa-4  Shutdown       Shutdown 17 hours ago
4ymbc2lwzf4dt1o7ooswilyrt  dbcluster.7      toughiq/mariadb-cluster  doswa-3  Running        Running 17 hours ago
c60ahb1mmtbjjzut0z31v2o3v  dbcluster.8      toughiq/mariadb-cluster  doswa-3  Shutdown       Shutdown 17 hours ago
1bk8o6eajfbwz668pkzv629g4   \_ dbcluster.8  toughiq/mariadb-cluster  doswa-5  Shutdown       Shutdown 17 hours ago
dc9j3annf9dn1aueo2n46i9lu  dbcluster.9      toughiq/mariadb-cluster  doswa-5  Shutdown       Shutdown 17 hours ago
5ke252yv31v9rajzsr3x8n9uc  dbcluster.10     toughiq/mariadb-cluster  doswa-4  Shutdown       Shutdown 17 hours ago

And the getent delivers in this case 5 cluster ip's also from instances in shutdown state. Unfortunately docker swarm seems not to clean up shuttet down instances. I'm currently not sure what is a good way around this.

Franselbaer avatar Oct 29 '16 17:10 Franselbaer

Hi @Franselbaer, I saw similar problems with the cluster discovery itself. Sometimes, if you do scale-out and scale-in repeatedly, the new nodes wont find existing ones. Or the cluster might break apart, since not every node can reach all the other members. I am not sure if the problem is the Swarm DNS or the networking itself. Sometimes I had the overlay network attached to all nodes, but no communication over this net was possible. In my opinion this problem is caused by Swarm and its DNS itself. The only way to prevent this would be to establish some kind of alternative service discovery. But this would make the whole idea obsolete, since DNS and service discovery should be an environmental feature, provided by the cluster management, and just consumed by the client/containers. Which Docker version did you use when getting your results? I didnt try the current 1.12.3 version yet to see if this behavior still exists.

toughIQ avatar Oct 29 '16 18:10 toughIQ

I am facing the same problem on 1.12.3

joneschan avatar Dec 08 '16 04:12 joneschan

I've testet this only with 1.12.3 because i startet into Docker with this version.

Franselbaer avatar Dec 08 '16 10:12 Franselbaer

@Franselbaer see https://github.com/docker/swarmkit/issues/1372

danfromtitan avatar May 19 '17 13:05 danfromtitan

It's an old bug, but causing some critical cases. It is one of bugs, you cannot use docker in live services.

  1. Deleting and changing network properties or name among swarm cluster, you can find your new network doesn't work properly. Old Created or Dead containers hold the network so that the old network to be preserved.

  2. Depleting the resources of your machine.

yunghoy avatar Nov 06 '17 08:11 yunghoy

Auto-discovery inside Swarm doesn't seem to work "at all" when the stack starts and there is a race condition between the cluster starting and MaxScale. Took me a while to figure this out.

till avatar Jan 23 '20 16:01 till

Hi! Similar issue. toughiq/maxscale give error ERROR 1045 (28000): failed to create new session if one of toughiq/mariadb-cluster swarm nodes recreated.

Docker version 19.03.12, build 48a66213fe

4n70w4 avatar Oct 08 '20 14:10 4n70w4

I had the similar issue on Swarm mode when I scaled up and down the db container. Even after I scaled up the containers back , all containers are on up and running status, I always get this error ERROR 1045 (28000): failed to create new session. Result from maxadmin -pmariadb list servers shows all the nodes are down as well even they are running on docker. Checked on galera.cnf, the wsrep-cluster-address is not update to the latest nodes' IP address, which means the new created nodes wont find existing ones. I also found that the galera service and the splitter listeners are all down. Can't find a way to manually restart the service and listener. Any solution until now?

gonzalloe avatar Jul 26 '22 09:07 gonzalloe