linstor-server
linstor-server copied to clipboard
Satellite has established a connection to a different controller
I've seen a strange behaviour on my test cluster recently. At random times, Controller refuses to connect to a Satellite claiming that the Satellite has established a connection to a different controller. Controller is running on 3 nodes and it's managed by drbd-reactor.
Restarting the affected Satellite node has no effect, but restarting the Controller "fixes" the problem until the next time when a random Satellite will be affected by the same issue. To me, this does not seem to be an issue on the Satellite(s) but rather on the Controller side, but I may be wrong.
Could this be a bug or something specific to this system ?
┊ Node ┊ NodeType ┊ Addresses ┊ State ┊
╞════════════════════════════════════════════════════════════════════════╡
┊ pve1 ┊ SATELLITE ┊ 10.10.13.1:3366 (PLAIN) ┊ Online ┊
┊ pve2 ┊ SATELLITE ┊ 10.10.13.2:3366 (PLAIN) ┊ Online ┊
┊ pve3 ┊ SATELLITE ┊ 10.10.13.3:3366 (PLAIN) ┊ Online ┊
┊ pve4 ┊ SATELLITE ┊ 10.10.13.4:3366 (PLAIN) ┊ OFFLINE(OTHER_CONTROLLER) ┊
┊ pve5 ┊ SATELLITE ┊ 10.10.13.5:3366 (PLAIN) ┊ Online ┊
╰───────────────────────────────────────────────────
Controller
—————————-
-- Logs begin at Tue 2021-11-16 17:31:25 UTC, end at Thu 2023-04-27 16:23:11 UTC. --
Apr 26 06:39:17 linctrl02 systemd[1]: Starting drbd-reactor controlled linstor-controller...
Apr 26 06:39:21 linctrl02 Controller[16524]: LINSTOR, Module Controller
Apr 26 06:39:22 linctrl02 Controller[16524]: Version: 1.22.0 (0dd2e7d8bad7b0115e924ef66371568320898285)
Apr 26 06:39:22 linctrl02 Controller[16524]: Build time: 2023-04-17T12:43:16+00:00
Apr 26 06:39:22 linctrl02 Controller[16524]: Java Version: 11
Apr 26 06:39:22 linctrl02 Controller[16524]: Java VM: Ubuntu, Version 11.0.18+10-post-Ubuntu-0ubuntu120.04.1
Apr 26 06:39:22 linctrl02 Controller[16524]: Operating system: Linux, Version 5.4.0-146-generic
Apr 26 06:39:22 linctrl02 Controller[16524]: Environment: amd64, 2 processors, 121 MiB memory reserved for allocations
Apr 26 06:39:22 linctrl02 Controller[16524]: System components initialization in progress
Apr 26 06:39:24 linctrl02 Controller[16524]: Loading configuration file "/etc/linstor/linstor.toml"
Apr 26 06:39:27 linctrl02 Controller[16524]: 06:39:27.914 [main] INFO LINSTOR/Controller - SYSTEM - ErrorReporter DB version 1 found.
Apr 26 06:39:27 linctrl02 Controller[16524]: 06:39:27.918 [main] INFO LINSTOR/Controller - SYSTEM - Log directory set to: '/var/log/linstor-controller'
Apr 26 06:39:28 linctrl02 Controller[16524]: 06:39:28.151 [main] INFO LINSTOR/Controller - SYSTEM - Database type is SQL
Apr 26 06:39:28 linctrl02 Controller[16524]: 06:39:28.154 [Main] INFO LINSTOR/Controller - SYSTEM - Loading API classes started.
Apr 26 06:39:29 linctrl02 Controller[16524]: 06:39:29.967 [Main] INFO LINSTOR/Controller - SYSTEM - API classes loading finished: 1812ms
Apr 26 06:39:29 linctrl02 Controller[16524]: 06:39:29.968 [Main] INFO LINSTOR/Controller - SYSTEM - Dependency injection started.
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.089 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule"
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.091 [Main] INFO LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.modularcrypto.FipsCryptoModule" is not installed
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.092 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule"
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.167 [Main] INFO LINSTOR/Controller - SYSTEM - Dynamic load of extension module "com.linbit.linstor.modularcrypto.JclCryptoModule" was successful
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.168 [Main] INFO LINSTOR/Controller - SYSTEM - Attempting dynamic load of extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule"
Apr 26 06:39:30 linctrl02 Controller[16524]: 06:39:30.169 [Main] INFO LINSTOR/Controller - SYSTEM - Extension module "com.linbit.linstor.spacetracking.ControllerSpaceTrackingModule" is not installed
Apr 26 06:39:34 linctrl02 Controller[16524]: 06:39:34.010 [Main] INFO LINSTOR/Controller - SYSTEM - Dependency injection finished: 4042ms
Apr 26 06:39:34 linctrl02 Controller[16524]: 06:39:34.011 [Main] INFO LINSTOR/Controller - SYSTEM - Cryptography provider: Using default cryptography module
Apr 26 06:39:35 linctrl02 Controller[16524]: 06:39:35.143 [Main] INFO LINSTOR/Controller - SYSTEM - Initializing authentication subsystem
Apr 26 06:39:36 linctrl02 Controller[16524]: 06:39:36.890 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'TimerEventService' of type TimerEventService
Apr 26 06:39:36 linctrl02 Controller[16524]: 06:39:36.898 [Main] INFO LINSTOR/Controller - SYSTEM - Initializing the database connection pool
Apr 26 06:39:36 linctrl02 Controller[16524]: 06:39:36.911 [Main] INFO LINSTOR/Controller - SYSTEM - SQL database connection URL is "jdbc:h2:/var/lib/linstor/linstordb"
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.287 [Main] INFO LINSTOR/Controller - SYSTEM - SQL database is H2
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.331 [Main] WARN org.flywaydb.core.api.Location - Use of dots (.) as path separators will be deprecated in Flyway 7. Path: com.linbit.linstor.dbcp.migration
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.354 [Main] INFO org.flywaydb.core.internal.license.VersionPrinter - Flyway Community Edition 6.5.7 by Redgate
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.525 [Main] INFO org.flywaydb.core.internal.database.DatabaseFactory - Database: jdbc:h2:/var/lib/linstor/linstordb (H2 1.4)
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.785 [Main] INFO org.flywaydb.core.internal.command.DbValidate - Successfully validated 69 migrations (execution time 00:00.168s)
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.905 [Main] INFO org.flywaydb.core.internal.command.DbMigrate - Current version of schema "LINSTOR": 2023.03.22.10.00
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.908 [Main] WARN org.flywaydb.core.internal.command.DbMigrate - outOfOrder mode is active. Migration of schema "LINSTOR" may not be reproducible.
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.953 [Main] INFO org.flywaydb.core.internal.command.DbMigrate - Schema "LINSTOR" is up to date. No migration necessary.
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.968 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'DatabaseService' of type DatabaseService
Apr 26 06:39:37 linctrl02 Controller[16524]: 06:39:37.968 [Main] INFO LINSTOR/Controller - SYSTEM - Loading security objects
Apr 26 06:39:38 linctrl02 Controller[16524]: 06:39:38.015 [Main] INFO LINSTOR/Controller - SYSTEM - Current security level is NO_SECURITY
Apr 26 06:39:38 linctrl02 Controller[16524]: 06:39:38.064 [Main] INFO LINSTOR/Controller - SYSTEM - Core objects load from database is in progress
Apr 26 06:39:38 linctrl02 Controller[16524]: 06:39:38.995 [Main] INFO LINSTOR/Controller - SYSTEM - Core objects load from database completed
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.004 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'TaskScheduleService' of type TaskScheduleService
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.007 [Main] INFO LINSTOR/Controller - SYSTEM - Initializing network communications services
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.010 [Main] WARN LINSTOR/Controller - SYSTEM - The SSL network communication service 'DebugSslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.088 [Main] INFO LINSTOR/Controller - SYSTEM - Created network communication service 'PlainConnector'
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.089 [Main] WARN LINSTOR/Controller - SYSTEM - The SSL network communication service 'SslConnector' could not be started because the keyStore file (/etc/linstor/ssl/keystore.jks) is missing
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.102 [Main] INFO LINSTOR/Controller - SYSTEM - Created network communication service 'SslConnector'
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.105 [Main] INFO LINSTOR/Controller - SYSTEM - Reconnecting to previously known nodes
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.282 [Main] INFO LINSTOR/Controller - SYSTEM - Reconnect requests sent
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.289 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'ScheduleBackupService' of type ScheduleBackupService
Apr 26 06:39:39 linctrl02 Controller[16524]: 06:39:39.295 [Main] INFO LINSTOR/Controller - SYSTEM - Starting service instance 'EbsStatusPoll' of type EbsStatusPoll
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: An illegal reflective access operation has occurred
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: Illegal reflective access by com.sun.xml.bind.v2.runtime.reflect.opt.Injector (file:/usr/share/linstor-server/lib/jaxb-impl-2.2.11.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int)
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: Please consider reporting this to the maintainers of com.sun.xml.bind.v2.runtime.reflect.opt.Injector
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
Apr 26 06:39:41 linctrl02 Controller[16524]: WARNING: All illegal access operations will be denied in a future release
Apr 26 06:39:44 linctrl02 Controller[16524]: Apr 26, 2023 6:39:44 AM org.glassfish.grizzly.http.server.NetworkListener start
Apr 26 06:39:44 linctrl02 Controller[16524]: INFO: Started listener bound to [[::]:3370]
Apr 26 06:39:44 linctrl02 Controller[16524]: Apr 26, 2023 6:39:44 AM org.glassfish.grizzly.http.server.HttpServer start
Apr 26 06:39:44 linctrl02 Controller[16524]: INFO: [HttpServer] Started.
Apr 26 06:39:44 linctrl02 Controller[16524]: 06:39:44.352 [Main] INFO LINSTOR/Controller - SYSTEM - Controller initialized
Apr 26 06:39:44 linctrl02 Controller[16524]: 06:39:44.646 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - LogArchive: Running log archive on directory: /var/log/linstor-controller
Apr 26 06:39:44 linctrl02 Controller[16524]: 06:39:44.787 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - LogArchive: No logs to archive.
Apr 26 06:39:45 linctrl02 systemd[1]: Started drbd-reactor controlled linstor-controller.
Apr 27 06:23:36 linctrl02 Controller[16524]: 06:23:36.413 [MainWorkerPool-2] WARN LINSTOR/Controller - SYSTEM - Satellite pve4 has established a connection to a different controller
Apr 27 06:39:39 linctrl02 Controller[16524]: 06:39:39.729 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - LogArchive: Running log archive on directory: /var/log/linstor-controller
Apr 27 06:39:40 linctrl02 Controller[16524]: 06:39:40.720 [TaskScheduleService] INFO LINSTOR/Controller - SYSTEM - LogArchive: No logs to archive.
Apr 27 14:00:15 linctrl02 Controller[16524]: 14:00:15.196 [grizzly-http-server-2] ERROR LINSTOR/Controller - SYSTEM - No active connection to satellite 'pve4'. [Report number 6448C71C-00000-000000]
Apr 27 16:00:19 linctrl02 Controller[16524]: 16:00:19.655 [grizzly-http-server-3] ERROR LINSTOR/Controller - SYSTEM - No active connection to satellite 'pve4'. [Report number 6448C71C-00000-000001]
Satellite
———————
Apr 27 17:44:54 pve4 Satellite[1624881]: 17:44:54.243 [Thread-32] INFO LINSTOR/Satellite - SYSTEM - Shutdown complete
Apr 27 17:44:56 pve4 systemd[1]: linstor-satellite.service: Succeeded.
Apr 27 17:44:56 pve4 systemd[1]: Stopped LINSTOR Satellite Service.
Apr 27 17:44:56 pve4 systemd[1]: linstor-satellite.service: Consumed 5h 20min 24.288s CPU time.
Apr 27 17:44:56 pve4 systemd[1]: Started LINSTOR Satellite Service.
Apr 27 17:45:06 pve4 Satellite[1051398]: LINSTOR, Module Satellite
Apr 27 17:45:11 pve4 Satellite[1051398]: Version: 1.22.0 (0dd2e7d8bad7b0115e924ef66371568320898285)
Apr 27 17:45:11 pve4 Satellite[1051398]: Build time: 2023-04-17T11:24:18+00:00
Apr 27 17:45:11 pve4 Satellite[1051398]: Java Version: 11
Apr 27 17:45:11 pve4 Satellite[1051398]: Java VM: Debian, Version 11.0.18+10-post-Debian-1deb11u1
Apr 27 17:45:11 pve4 Satellite[1051398]: Operating system: Linux, Version 5.15.35-1-pve
Apr 27 17:45:11 pve4 Satellite[1051398]: Environment: amd64, 8 processors, 1992 MiB memory reserved for allocations
Apr 27 17:45:11 pve4 Satellite[1051398]: System components initialization in progress
Apr 27 17:45:35 pve4 Satellite[1051398]: 17:45:35.517 [main] INFO LINSTOR/Satellite - SYSTEM - ErrorReporter DB version 1 found.
Apr 27 17:45:35 pve4 Satellite[1051398]: 17:45:35.520 [main] INFO LINSTOR/Satellite - SYSTEM - Log directory set to: '/var/log/linstor-satelli
te'
Apr 27 17:45:36 pve4 Satellite[1051398]: 17:45:36.961 [Main] INFO LINSTOR/Satellite - SYSTEM - Loading API classes started.
Apr 27 17:45:38 pve4 Satellite[1051398]: 17:45:38.935 [Main] INFO LINSTOR/Satellite - SYSTEM - API classes loading finished: 1974ms
Apr 27 17:45:38 pve4 Satellite[1051398]: 17:45:38.936 [Main] INFO LINSTOR/Satellite - SYSTEM - Dependency injection started.
Apr 27 17:45:39 pve4 Satellite[1051398]: 17:45:39.434 [Main] INFO LINSTOR/Satellite - SYSTEM - Attempting dynamic load of extension module "co
m.linbit.linstor.modularcrypto.FipsCryptoModule"
Apr 27 17:45:39 pve4 Satellite[1051398]: 17:45:39.435 [Main] INFO LINSTOR/Satellite - SYSTEM - Extension module "com.linbit.linstor.modularcry
pto.FipsCryptoModule" is not installed
Apr 27 17:45:39 pve4 Satellite[1051398]: 17:45:39.435 [Main] INFO LINSTOR/Satellite - SYSTEM - Attempting dynamic load of extension module "co
m.linbit.linstor.modularcrypto.JclCryptoModule"
Apr 27 17:45:40 pve4 Satellite[1051398]: 17:45:40.123 [Main] INFO LINSTOR/Satellite - SYSTEM - Dynamic load of extension module "com.linbit.li
nstor.modularcrypto.JclCryptoModule" was successful
Apr 27 17:45:49 pve4 Satellite[1051398]: 17:45:49.721 [Main] INFO LINSTOR/Satellite - SYSTEM - Dependency injection finished: 10784ms
Apr 27 17:45:49 pve4 Satellite[1051398]: 17:45:49.722 [Main] INFO LINSTOR/Satellite - SYSTEM - Cryptography provider: Using default cryptograp
hy module
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.203 [Main] INFO LINSTOR/Satellite - SYSTEM - Initializing main network communications servic
e
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.203 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'TimerEventService' o
f type TimerEventService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.204 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'FileEventService' of
type FileEventService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.204 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'DrbdEventService-1'
of type DrbdEventService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.207 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'DrbdEventPublisher-1
' of type DrbdEventPublisher
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.207 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'SnapshotShippingServ
ice' of type SnapshotShippingService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.207 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'BackupShippingS3Serv
ice' of type BackupShippingS3Service
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.208 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'BackupShippingL2LSer
vice' of type BackupShippingL2LService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.208 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'DeviceManager' of ty
pe DeviceManager
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.209 [Main] INFO LINSTOR/Satellite - SYSTEM - Starting service instance 'CloneService' of typ
e CloneService
Apr 27 17:46:20 pve4 Satellite[1051398]: 17:46:20.965 [Main] INFO LINSTOR/Satellite - SYSTEM - NetComService started on port /0:0:0:0:0:0:0:0:
3366
Package versions:
Satellite
---------
ii linstor-client 1.18.0-1 all Linstor client command line tool
ii linstor-common 1.22.0-1 all DRBD distributed resource management utility
ii linstor-proxmox 7.0.0-1 all DRBD distributed resource management utility
ii linstor-satellite 1.22.0-1 all DRBD distributed resource management utility
ii python-linstor 1.18.0-1 all Linstor python api library
cat /proc/drbd
version: 9.2.3 (api:2/proto:86-122
Controller
----------
ii linstor-client 1.18.0-1ppa1~focal1 all Linstor client command line tool
ii linstor-common 1.22.0-1ppa1~focal1 all DRBD distributed resource management utility
ii linstor-controller 1.22.0-1ppa1~focal1 all DRBD distributed resource management utility
ii python-linstor 1.18.0-1ppa1~focal1 all Linstor python api library
cat /proc/drbd
version: 9.2.3 (api:2/proto:86-122)
I also do facing same problem. With two different controller on two different servers are installed, satellites connect to another controller if the main controller is restarting. But this is not primary-secondary controller setup. It is like satellites do a sort of dhcp to find a controller to connect.
There should only be one active(started) controller at a time. It is not supported having multiple controller running at the same time using the same satellite nodes.
yes Mimikoo's setup looks just wrong, but acidrop uses drbd-reactor, there shouldn't be two instances of the controller active at a time.
some TCP timeout thingy where this races? where the old controller is gone and the new started but the satellite did not see that the old is already gone? basically why some setups use some kind of "portblock" magic?
Just realized that not satellites are connecting to controllers, but controllers are connecting to satellites. My setup was a old controller that where in running state with nodes defined. And the new controller was restarted.
There must be some configurations for satellites to define a password or some connection authentication so a malicious controller would not get nodes and break the informaion.
Looks like this is reproducible only when the nodes are under heavy load (drbd-reactor and the linstor controller in my use case is running within VMs).